xc-llm-ascend/vllm_ascend/patch/__init__.py

#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
# This file is a part of the vllm-ascend project.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# ----------------------------------------------------------------------------------
# This module manage the patch for vllm. There are two folders in this module:
# - platform: contains the patches applied before worker starts. It's called by
#             `vllm_ascend.utils.adapt_patch(is_global_patch=True)` in
#             `vllm_ascend.platform.NPUPlatform.pre_register_and_update()` function.
# - worker: contains the patches applied when worker starts. It's called by
#           `vllm_ascend.utils.adapt_patch(is_global_patch=False)` in
#           each worker's `__init__` function.
#
# Once a new patch is added in vllm-ascend, please add the patch description into this file as well.
# ----------------------------------------------------------------------------------

# What's Patched and how it works:
# --------------------------------
# * Platform Patch:
# =================
# ** 1. File: platform/patch_distributed.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `torch.distributed.all_reduce`, `torch.distributed.broadcast`
#    Why:
#       tensor alignment for 310p
#    How：
#       rewrite all_reduce and broadcast in torch.distributed
#    Related PR (if no, explain why):
#       No, not ready yet.
#    Future Plan:
#       Find a better way to support tensor alignment for 310p without this patch.
#
# ** 2. File: platform/patch_mamba_config.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.model_executor.models.config.HybridAttentionMambaModelConfig.verify_and_update_config`
#    Why:
#       block size is set to 16 in vLLM which is not supported by Ascend.
#    How：
#       Set block size to 128 on npu.
#    Related PR (if no, explain why):
#       we'll fix this in vLLM soon.
#    Future Plan:
#       Remove this patch when vLLM merges the PR.
#
# ** 3. File: platform/patch_multiproc_executor.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.v1.executor.multiproc_executor.MultiprocExecutor`
#    Why:
#       vLLM create child process with daemon=True, which doesn't work with EPLB case, since EPLB will create
#       a new process which is not allowed by daemon=True.
#    How：
#       Set daemon=False in MultiprocExecutor.
#    Related PR (if no, explain why):
#       Find a way to support daemon=False in vLLM
#    Future Plan:
#       Remove this patch when vLLM fix the issue.
#
# ** 4. File: platform/patch_sched_yield.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.distributed.utils.USE_SCHED_YIELD`
#    Why:
#       os.sched_yield() doesn't work on Arm systems.
#    How：
#       avoid using os.sched_yield() on Arm systems.
#    Related PR (if no, explain why):
#       https://github.com/vllm-project/vllm/pull/30228
#    Future Plan:
#       Remove this patch when vLLM merge the PR.
#
# ** 5. File: platform/patch_balance_schedule.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.v1.engine.core.EngineCoreProc.run_engine_core`
#      `vllm.v1.core.sched.scheduler.Scheduler`
#    Why:
#       vLLM v1 scheduling currently enables chunkedprefill by default, which processes prefill and decode
#       requests simultaneously in a single scheduling session. This can impact the overall system throughput
#       and performance in some scenarios.
#    How：
#       Set environmental variables VLLM_ASCEND_BALANCE_SCHEDULING=1 in startup script.
#    Related PR (if no, explain why):
#       https://github.com/vllm-project/vllm/pull/29721
#    Future Plan:
#       Remove this patch when vLLM merge the PR.
#
# ** 6. File: platform/patch_fusion_matcher_compat_ops.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `torch.ops._C.rms_norm`, `torch.ops._C.fused_add_rms_norm`,
#    Why:
#       upstream vLLM initializes fusion matcher global operators at import time.
#       On Ascend environment these symbols may be absent and cause import failure.
#    How：
#       inject placeholders only when the symbols are missing so import can continue.
#    Related PR (if no, explain why):
#       temporary compatibility patch before upstream adjustment is merged.
#    Future Plan:
#       remove this patch once upstream no longer requires these global symbols or
#       provides a backend-safe initialization path.
#
# ** 7. File: platform/patch_minimax_m2_config.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.config.model.ModelConfig._verify_quantization`
#    Why:
#       MiniMax-M2 fp8 checkpoints on NPU may fail upstream quantization validation.
#       vllm-ascend needs to disable fp8 quantization and load bf16 dequantized
#       weights in worker-side patches instead.
#    How：
#       Monkey-patch `_verify_quantization` and intercept platform quantization
#       verification to force `cfg.quantization=None` for MiniMax-M2 fp8 on NPU.
#    Related PR (if no, explain why):
#       No, upstream behavior differs across versions and needs discussion.
#    Future Plan:
#       Remove this patch once upstream supports MiniMax-M2 fp8 on NPU or provides
#       a backend-safe validation / override mechanism.
#
#   2. `vllm.config.model.ModelConfig._verify_cuda_graph`
#    Why:
#       For MiniMax-M2 on NPU with ACL graph capture enabled, HCCL op expansion
#       mode affects graph shape coverage. Users may forget to set it.
#    How：
#       If user doesn't set it, set `HCCL_OP_EXPANSION_MODE=AIV` for this model
#       and log a warning when a different value is detected.
#    Related PR (if no, explain why):
#       No, this is an environment-specific tuning knob.
#    Future Plan:
#       Remove this patch if upstream provides an official NPU graph-capture
#       guidance / auto-configuration path for HCCL.
#
#   3. `vllm.config.speculative.SpeculativeConfig._verify_args`
#    Why:
#       Upstream vLLM's eagle3/extract_hidden_states restricts target model types
#       via a whitelist. MiniMax-M2 should be allowed once the worker-side model
#       can emit auxiliary hidden states.
#    How：
#       Monkey-patch `_verify_args` to bypass only the whitelist ValueError for
#       MiniMax model_type when method is eagle3/extract_hidden_states.
#       SpeculativeConfig is a Pydantic dataclass (`@config`); init validation calls
#       `__pydantic_decorators__.model_validators["_verify_args"].func`, so that
#       `Decorator.func` must be replaced (not only `SpeculativeConfig._verify_args`),
#       then `rebuild_dataclass(SpeculativeConfig, force=True)`.
#       If `VllmConfig` was imported earlier, also `rebuild_dataclass(VllmConfig, ...)`
#       so nested `speculative_config` validation does not use a stale schema.
#    Related PR (if no, explain why):
#       https://github.com/vllm-project/vllm/pull/37512
#    Future Plan:
#       Remove this patch once upstream whitelist includes MiniMax.
#
#   4. `vllm.model_executor.models.registry` (spec decode aliases)
#    Why:
#       Some Eagle3 draft checkpoints may declare a MiniMax-specific architecture
#       string while reusing the shared Eagle3 implementation.
#    How：
#       Register `Eagle3MiniMaxM2ForCausalLM` as an alias pointing to the
#       existing Eagle3 implementation in the speculative decoding registry.
#    Related PR (if no, explain why):
#       https://github.com/vllm-project/vllm/pull/37512
#    Future Plan:
#       Drop the alias once upstream registry includes it or the checkpoint
#       standardizes architecture strings.
#
# ** 8. File: platform/patch_kv_cache_interface.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.v1.kv_cache_interface.MLAAttentionSpec`
#    Why:
#       The default `MLAAttentionSpec` is mainly built around `kv_lora_rank`
#       and `qk_rope_head_dim`. On NPU, we also use this class to describe DSA
#       models. Unlike the GPU path, where cache management is handled by an
#       additional indexer module, extending this class directly simplifies the
#       corresponding `model_runner` implementation on NPU.
#
#       This patch also adds Sparse C8 support for DSA models on NPU. As part
#       of that support, members such as `page_size_bytes` need to be adapted,
#       so they are overridden here as well to preserve overall readability.
#    How:
#       This patch subclasses the original implementation, overrides selected
#       methods, and adds DSA-specific attributes and helpers with default
#       values where needed.
#    Related PR (if no, explain why):
#       https://github.com/vllm-project/vllm/pull/25896
#    Future Plan:
#       Remove this patch after the upcoming KV cache spec refactor.
#
# ** 9. File: platform/patch_minimax_usage_accounting.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.entrypoints.openai.chat_completion.serving.OpenAIServingChat`
#      `vllm.entrypoints.openai.engine.protocol.UsageInfo`
#      `vllm.reasoning.minimax_m2_reasoning_parser`
#    Why:
#       MiniMax M2 reasoning outputs use `</think>` as the only boundary token,
#       but the runtime usage accounting path either omits reasoning token
#       details entirely or counts them incorrectly.
#    How：
#       Monkey-patch the MiniMax reasoning token counters, extend `UsageInfo`
#       with `completion_tokens_details.reasoning_tokens`, and update chat
#       streaming/non-streaming usage generation to propagate the corrected
#       counts.
#    Related PR (if no, explain why):
#       https://github.com/vllm-project/vllm/pull/37955
#    Future Plan:
#       Remove this patch once the upstream MiniMax usage-accounting fix is in
#       the runtime vLLM version used by vllm-ascend.
#
# ** 10. File: platform/patch_glm_tool_call_parser.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.entrypoints.openai.chat_completion.serving.OpenAIServingChat`
#      `vllm.tool_parsers.glm4_moe_tool_parser.Glm4MoeModelToolParser`
#    Why:
#       GLM-4.7 / GLM-4.5 tool-call streaming on the release runtime still has
#       two independent finish-path bugs:
#       1. the parser can leave a terminal `<arg_value>... </tool_call>` chunk
#          partially undrained, and
#       2. finish backfill trusts the parser's internal accumulated arguments
#          instead of the argument bytes actually sent to the client.
#       Together these can drop a full string value or emit only a suffix like
#       `"}` in the final SSE chunk even when non-stream output is correct.
#    How：
#       Monkey-patch the GLM parser to keep draining a single chunk through
#       terminal state transitions, and monkey-patch chat streaming to track
#       per-tool arguments actually emitted to the client before computing the
#       finish-chunk suffix. The suffix logic still tolerates mixed JSON
#       whitespace styles from GLM tool parsers.
#    Related PR (if no, explain why):
#       https://github.com/vllm-project/vllm/pull/37845
#       https://github.com/vllm-project/vllm/pull/33218
#    Future Plan:
#       Remove this patch once both the GLM parser drain fix and the serving
#       finish-backfill fix are present in the runtime vLLM version used by
#       vllm-ascend.
#
# * Worker Patch:
# ===============
#
# ** 1. File: worker/patch_distributed.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.distributed.parallel_state.GroupCoordinator`
#    Why:
#       vllm doesn't support all_to_all for GroupCoordinator.
#    How：
#       Add all_to_all implementation for GroupCoordinator.
#    Related PR (if no, explain why):
#       No, we should use vlLM all2all manager to support all_to_all for npu.
#    Future Plan:
#       Remove this patch when the refactor of all2all manager is done.
#
# ** 2. File: worker/patch_multimodal_merge.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.model_executor.models.utils._merge_multimodal_embeddings`
#    Why:
#       '_merge_multimodal_embeddings' func of vllm is incompatible with Ascend.
#    How：
#       Replace with CPU operation that can be executed asynchronously.
#    Related PR (if no, explain why):
#       This is a bug by Ascend only. It can' be fixed in vLLM.
#    Future Plan:
#       Identify this pattern in torch-npu and remove this patch.
#
# ** 3. File: worker/patch_bert.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.model_executor.models.bert._encode_token_type_ids`
#      `vllm.model_executor.models.bert._decode_token_type_ids`
#    Why:
#       shift operation in `_encode_token_type_ids` and `_decode_token_type_ids` cannot run in ascend aclgraph mode
#    How：
#       Replace shift operation with multiplication and division.
#    Related PR (if no, explain why):
#       No, this need CANN add an aclnn shift operation
#    Future Plan:
#       Revert this when CANN support shift aclnn operation
#
# ** 4. File: worker/patch_triton.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.model_executor.layers.mamba.ops`, `vllm.model_executor.layers.fla.ops`,
#      `vllm.v1.worker.gpu.sample.gumbel.gumbel_sample`
#    Why:
#       triton ops in vLLM perform not good on NPU. And there is no dispatch mechanism for triton ops.
#    How：
#       override triton ops in vLLM with ascend implementation
#    Related PR (if no, explain why):
#       Let vLLM support triton ops dispatch.
#    Future Plan:
#       Remove this patch when vLLM support the dispatch function.
#
# ** 5. File: worker/patch_qwen3_next_mtp.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.v1.worker.utils.bind_kv_cache`
#    Why:
#       'bind_kv_cache' func will raise an exception when current_platform is npu.
#    How：
#       Replace with a new bind_kv_cache.
#       Skip the raise.
#    Related PR (if no, explain why):
#       It need discuss.
#    Future Plan:
#       Remove this patch after discussing with vllm community and adapting bind_kv_cache to npu.
#
# ** 6. File: worker/patch_rejection_sampler.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.v1.sample.rejection_sampler`
#    Why:
#       - some functions from `rejection_sampler` are not supported or slow on npu.
#    How：
#       - add npu_top_k_top_p to 'apply_sampling_constraints' func
#       - add custom triton kernel to `expand_batch_to_tokens` and `rejection_sample`
#    Related PR (if no, explain why):
#       Let vLLM support triton ops dispatch.
#    Future Plan:
#       1. make these functions as class func of RejectionSampler, create AscendRejectionSampler
#           to override them, then delete the patch file `worker/patch_rejection_sampler.py`.
#       2. make these functions as costom op, then remove AscendRejectionSampler
#
## ** 7. File: worker/patch_module.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.v1.attention.backends.gdn_attn.torch.argsort`
#    Why:
#       1. 'torch.argsort' func of npu does not support bool.
#       2. Without `stable=True`, the output will have a lot of redundant tokens.
#    How：
#       Replace with a new torch.argsort that will cast the input to torch.int32
#       and do stable sort.
#    Related PR (if no, explain why):
#       1. It depends on torch_npu.
#       2. https://github.com/vllm-project/vllm/pull/30632
#    Future Plan:
#       Remove this patch when bool is supported in 'torch.argsort' func of npu.
#       Make 'torch.argsort' in `vllm.v1.attention.backends.gdn_attn` be stable.
#
# ** 7a. File: worker/patch_gdn_attn.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.v1.attention.backends.gdn_attn.GDNAttentionMetadataBuilder.build`
#    Why:
#       Qwen3.5/Qwen3Next GDN prefill on NPU needs prebuilt varlen chunk metadata
#       to avoid forward-time host round-trips that break async scheduling.
#    How：
#       Monkey-patch the upstream builder in-place, keep upstream code untouched,
#       and attach prebuilt device metadata bundle onto the returned attention
#       metadata object for Ascend-specific consumers.
#    Future Plan:
#       Remove this patch when upstream exposes a backend hook for extending GDN
#       metadata or when the optimization is accepted upstream directly.
#
# ** 8. File: worker/patch_qwen3_next.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.model_executor.models.qwen3_next.Qwen3NextGatedDeltaNet.forward`
#    Why:
#       The Qwen3Next GatedDeltaNet forward cannot directly add custom operators.
#    How：
#       Add a branch in Qwen3NextGatedDeltaNet.forward to adapt to fused_qkvzba_split_reshape_cat.
#    Related PR (if no, explain why):
#       https://github.com/vllm-project/vllm/pull/30863
#    Future Plan:
#       Remove this patch when vLLM support these operators.
#
#   2. `vllm.model_executor.models.qwen3_next.Qwen3NextGatedDeltaNet._forward_core`
#    Why:
#       triton ops fused_recurrent_gated_delta_rule and fused_gdn_gating in vLLM perform not good on NPU.
#    How：
#       add a new fused triton ops in vLLM with ascend implementation.
#    Related PR (if no, explain why):
#       https://github.com/vllm-project/vllm/pull/30860
#    Future Plan:
#       Remove this patch when vLLM support these operators.
#
#   3. `vllm.model_executor.models.qwen3_next.Qwen3NextGatedDeltaNet._forward_core`
#    Why:
#       The Qwen3Next GatedDeltaNet _forward_core cannot directly add custom operators.
#    How：
#       Add a branch in Qwen3NextGatedDeltaNet._forward_core to adapt to fused_gdn_gating_patch.
#    Related PR (if no, explain why):
#       https://github.com/vllm-project/vllm/pull/31002
#    Future Plan:
#       Remove this patch when vLLM support these operators.
#
# ** 9. File: worker/patch_huanyuan_vl.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.transformers_utils.processors.hunyuan_vl.HunYuanVLProcessor.__call__`
#    Why:
#       The `add_special_tokens` parameter is not supported by default in the processor.
#    How：
#       Remove the `add_special_tokens` parameter from kwargs before calling the original method.
#    Future Plan:
#       Remove this patch when vLLM aligns with the latest processor implementation.
#
# ** 10. File: worker/patch_v2/patch_eagle.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.v1.worker.gpu.spec_decode.eagle.EagleSpeculator.propose`
#    Why:
#       `propose` method use torch.gather, but the gather operator will
#       pollute the arguments passed to it. the bug is reported to huawei
#       CANN team, but not fixed yet.
#    How：
#       clone the out attribute ahead of gather to avoid the bug.
#    Future Plan:
#       Remove this patch when cann fix the gather bug.
#
# ** 11. File: worker/patch_unquantized_gemm.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.model_executor.layers.utils.default_unquantized_gemm`
#    Why:
#       unquantized_gemm in vLLM not is a customop, which will make MatmulAllReduceAddRMSNorm fusion failure.
#    How：
#       make unquantized_gemm as a customop.
#    Future Plan:
#       Remove this patch when vLLM support the operator as customop.
#
# ** 12. File: worker/patch_npugraph_ex_triton.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `torchair.core._concrete_graph.ValuePack`,
#      `torchair.npu_fx_compiler._unpack_meta`,
#      `torchair.npu_fx_compiler._NpuGraphConverter._unpack_npu`
#    Why:
#       In the Triton scenario, npugraph_ex backend needs to process the value pack of the input parameters.
#    How：
#       Supplement the relevant processing logic through patches.
#    Related PR (if no, explain why):
#       https://gitcode.com/Ascend/torchair/pull/2575
#    Future Plan:
#       Remove this patch when the PTA version used by vllm-ascend has been upgraded.
#
# ** 13. File: worker/patch_v2/patch_uva.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.v1.worker.gpu.states.UvaBuffer`
#    Why:
#       ASCEND NPUs do not support UVA yet, so we need to wrap it in vLLM.
#    How：
#       make UvaBuffer a dummy class, mimic the interface of vllm UvaBuffer.
#    Future Plan:
#       Remove this patch when NPU support UVA.
#
# ** 14. File: worker/patch_kimi_k25.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.model_executor.models.kimi_k25_vit.Learnable2DInterpPosEmbDivided_fixed.forward`
#    Why:
#       The forward method uses interpolate with ops not supported on NPU.
#    How：
#       Replace with a new forward that uses CPU for interpolate when shape mismatch,
#       and use get_rope_shape to handle the rope shape interpolation.
#    Future Plan:
#       Remove this patch when vLLM aligns with the latest main.
#
# ** 15. File: worker/patch_routed_experts_capturer.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.model_executor.layers.fused_moe.routed_experts_capturer.RoutedExpertsCapturer.init_buffer`
#    Why:
#       The `_device_buffer` initialization in vLLM uses `device="cuda"` hardcoded,
#       which doesn't work on NPU.
#    How：
#       Replace `device="cuda"` with `device=current_platform.device_name` to support NPU.
#    Related PR (if no, explain why):
#       https://github.com/vllm-project/vllm/pull/34336
#    Future Plan:
#       Remove this patch when vLLM merges the PR.
# ** 16. File: worker/patch_draft_quarot.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.model_executor.models.llama_eagle3.Eagle3LlamaForCausalLM.load_weights`
#    Why:
#       vllm-ascend reused the loading logic of drafter model from vllm,
#       but vllm doesn't need to apply to Ascend quantization.
#    How：
#       Dynamically replace the `load_weights` function at runtime,
#       and fix `target_config` into the new implementation with a closure.
#    Related PR (if no, explain why):
#       https://github.com/vllm-project/vllm/pull/36225
#    Future Plan:
#       Remove this patch when vLLM merges the PR.
#
# ** 17. File: worker/patch_minimax_m2.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.model_executor.models.minimax_m2.MiniMaxM2MoE.forward`
#    Why:
#       In TP mode, MiniMax-M2 MoE needs a backend-aware reduction path to avoid
#       unnecessary communication / maintain correctness on NPU.
#    How：
#       Replace the forward to call `experts.maybe_all_reduce_tensor_model_parallel`
#       when `tp_size > 1`.
#    Related PR (if no, explain why):
#       No, model-specific behavior.
#    Future Plan:
#       Move this behavior upstream once a generic MoE reduce hook exists.
#
#   2. `vllm.model_executor.models.minimax_m2.MiniMaxM2Attention.__init__`
#    Why:
#       When total kv heads < TP world size, kv head replication happens and k_norm
#       weights should be sharded to match the replication layout.
#    How：
#       Add `num_kv_head_replicas` and create sharded `k_norm` via
#       `MiniMaxText01RMSNormTP(..., weight_shard_world_size=total_num_kv_heads, ...)`.
#    Related PR (if no, explain why):
#       No, depends on Ascend kernel behavior and TP layout.
#    Future Plan:
#       Remove this patch if upstream implements kv-head-aware norm sharding.
#
#   3. `vllm.model_executor.models.minimax_m2.MiniMaxM2Model.load_weights`
#    Why:
#       MiniMax-M2 fp8 checkpoints may store fp8 weights with per-block inverse
#       scales. On NPU we load bf16 weights by dequantizing at load time.
#    How：
#       Inject fp8 dequant helpers and wrap `load_weights` to convert fp8 weight +
#       `weight_scale_inv` pairs into bf16 blocks before delegating to upstream.
#    Related PR (if no, explain why):
#       No, fp8 load format and backend constraints are model/backend specific.
#    Future Plan:
#       Remove this patch when upstream supports MiniMax-M2 fp8 loading on NPU.
#
#   4. `vllm.model_executor.models.minimax_m2.MiniMaxM2Model.forward`
#    Why:
#       Eagle3 speculative decoding needs auxiliary hidden states from specific
#       transformer layers of the target model.
#    How：
#       Extend `MiniMaxM2Model.forward` to optionally collect and return
#       `(final_hidden_states, aux_hidden_states)` when `aux_hidden_state_layers`
#       is set by the runtime.
#    Related PR (if no, explain why):
#       https://github.com/vllm-project/vllm/pull/37512
#    Future Plan:
#       Remove this patch once upstream MiniMax-M2 integrates Eagle3 support.
#
#   5. `vllm.model_executor.models.minimax_m2.MiniMaxM2ForCausalLM`
#    Why:
#       vLLM core uses SupportsEagle3-style methods to configure which layers
#       should emit auxiliary hidden states.
#    How：
#       Inject `set_aux_hidden_state_layers` and default-layer getters onto
#       `MiniMaxM2ForCausalLM` so vLLM can configure the target model.
#    Related PR (if no, explain why):
#       https://github.com/vllm-project/vllm/pull/37512
#    Future Plan:
#       Remove this patch once upstream provides these methods on the model.
#
# ** 18. File: worker/patch_minimax_m2_linear_attn.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.model_executor.layers.mamba.linear_attn.MiniMaxText01RMSNormTP.__init__`
#      `vllm.model_executor.layers.mamba.linear_attn.MiniMaxText01RMSNormTP.weight_loader`
#    Why:
#       MiniMax-M2 linear attention RMSNorm needs weight sharding that can follow
#       TP layout (and sometimes kv-head replication) on NPU.
#    How：
#       Override `__init__` to parameterize weight shard world/rank and install a
#       sharded `weight_loader` implementation.
#    Related PR (if no, explain why):
#       No, upstream API surface differs across versions.
#    Future Plan:
#       Remove this patch when upstream exposes stable sharding hooks for this layer.
#
#   2. `vllm.model_executor.layers.mamba.linear_attn.MiniMaxText01RMSNormTP.forward_qk`
#      (or older `_normalize_qk`)
#    Why:
#       q/k norm for linear attention is performance-sensitive. On NPU, a fused
#       rms_norm kernel is faster and TP needs a global rstd correction.
#    How：
#       Replace q/k normalization with NPU rms_norm fast path and TP-global rstd
#       correction; fall back to upstream implementation on non-NPU.
#    Related PR (if no, explain why):
#       No, backend-specific optimization.
#    Future Plan:
#       Remove this patch when upstream adds a backend dispatch path for q/k norm.
#
# ** 19. File: worker/patch_qwen3_5.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.model_executor.models.qwen3_5.Qwen3_5GatedDeltaNet._forward_core`
#    Why:
#       The class Qwen3_5GatedDeltaNet reuse the `_forward_core` method of Qwen3NextGatedDeltaNet,
#       but the ascendC ops of Qwen3NextGatedDeltaNet do not support ssm_state with float32 format.
#    How：
#       patch Qwen3_5GatedDeltaNet._forward_core to use triton ops like `fused_recurrent_gated_delta_rule`.
#    Future Plan:
#       Remove this patch when all ops in _forward_core support both Qwen3_5 and Qwen3Next.
#
# ** 20. File: worker/patch_cudagraph.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.v1.cudagraph_dispatcher.CudagraphDispatcher._create_padded_batch_descriptor`
#    Why:
#       vllm's FULL mode will cause error, we use a patch to avoid it.
#       After that, FULL can be enable now.
#    How：
#       Dynamically replace the `_create_padded_batch_descriptor` function at runtime,
#       and change the condition of if.
#    Related PR (if no, explain why):
#       https://github.com/vllm-project/vllm/pull/34880
#    Future Plan:
#       Remove this patch when vLLM merges the PR.
#
# ** 21. File: worker/patch_deepseek_mtp.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.model_executor.models.deepseek_v2.get_spec_layer_idx_from_weight_name` and
#      `vllm.model_executor.models.deepseek_mtp.get_spec_layer_idx_from_weight_name`
#    Why:
#       When GLM5 uses rotary quant in vllm-ascend, the MTP layer needs to load an extra weight
#       named `rot.weight`.
#    How：
#       If weight name starts with `rot`, return `layer_id + i` like other tensors in MTP layer.
#    Related PR (if no, explain why):
#       Rotary quant is a unique feature of vllm-ascend.
#    Future Plan:
#       Remove this patch when vllm supports rotary quant or pluggable `MultiTokenPredictorLayer`.
#   2. `vllm.model_executor.models.deepseek_mtp.DeepSeekMultiTokenPredictorLayer`
#    Why:
#       When GLM5 uses rotary quant in vllm-ascend, the `previous_hidden_states` does not .
#    How：
#       If the target model uses rotary quant, a new linear operation is added before `ehnorm`.
#    Related PR (if no, explain why):
#       Rotary quant is a unique feature of vllm-ascend.
#    Future Plan:
#       Remove this patch when vllm supports rotary quant or pluggable `MultiTokenPredictorLayer`.
#   3. `vllm.model_executor.models.deepseek_mtp.DeepSeekMTP._rewrite_spec_layer_name`
#    Why:
#       Rename `rot.weight` to match the format of weights in `DeepSeekMTP`.
#    How：
#       If the weight name is `rot`, rename it to `model.layers.{spec_layer}.rot.weight`.
#    Related PR (if no, explain why):
#       Rotary quant is a unique feature of vllm-ascend.
#    Future Plan:
#       Remove this patch when vllm supports rotary quant or pluggable `MultiTokenPredictorLayer`.
# ** 22. File: worker/patch_mamba_utils.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.v1.worker.mamba_utils.batch_memcpy_kernel = batch_memcpy_kernel`
#    Why:
#       Oringnal batch_memcpy_kernel implemented in vLLM might encounter bugs when running on
#       Ascend hardwares.
#    How：
#       patch to fix related bugs.
#    Future Plan:
#       Remove this patch when:
#       (1) oringnal batch_memcpy_kernel can run on Ascend hardware.
#       or
#       (2) design a dispatch mechanism for batch_memcpy_kernel.
#   2. `vllm.v1.worker.mamba_utils.batch_memcpy = batch_memcpy`
#    Why:
#       vLLM use BLOCK_SIZE 1024 for batch_memcpy_kernel. This results in suboptimal performance
#       on Ascend hardwares.
#    How：
#       patch to change BLOCK_SIZE to 8192.
#    Future Plan:
#       Remove this patch when:
#       design a dispatch mechanism for batch_memcpy_kernel.
#
# ** 23. File: worker/patch_weight_utils.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.model_executor.models.deepseek_v2.DeepseekV2ForCausalLM.load_weights`
#    Why:
#       The C8 weight quantized by modelslim will modify the model structure,
#       and the scale and offset required for kvcache quantization will increase.
#       In addition, the names of the quantization parameters are different from
#       those in the community.
#    How：
#       we have enhanced the maybe_remap_kv_scale_name function.
#    Future Plan:
#       The maybe_remap_kv_scale_name function of the community is reconstructed to support
#       multiple backends.
# ** 24. File: worker/patch_v2/patch_input_batch.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.v1.worker.gpu.input_batch.InputBatch`
#    Why:
#       vllm use InputBatch to make dummy tensors. in `model_runner.py` and `cudagraph_utils.py`
#       which make it difficult to inherit from vllm methods.
#    How：
#       replace InputBatch with AscendInputBatch.
#    Future Plan:
#       remove this patch when vLLM-ascend's make_dummy behavior aligns with vLLM.
# ** 25. File: worker/patch_v2/patch_block_table.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.v1.worker.gpu.block_table.BlockTables`
#    Why:
##      vllm-ascend need to initialize slot mapping as torch.int32 dtype,
#       but vllm default is torch.int64 dtype.
#    How：
#       replace BlockTables with AscendBlockTables which initialize slot mapping
#       as torch.int32 dtype.
#    Future Plan:
#       remove this patch when vLLM-ascend's BlockTables can initialize
#       slot mapping as torch.int64 dtype.
# ** 25. File: worker/patch_v2/patch_model_state.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.v1.worker.gpu.model_states.default.init_model_state`
#    Why:
##      vllm's prepare_attn in ModelState is different from vllm,
#       we need to override init_model_state.
#    How：
#       Define AscendModelState and initialize it in init_model_state.
#    Future Plan:
#       remove this when vllm-ascend's attention metadata is align with vllm.
# ** 26. File: worker/patch_v2/patch_triton.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   1. `vllm.v1.worker.gpu.sample.logprob`, `vllm.v1.worker.gpu.sample.penalties.apply_penalties`,
#      `vllm.v1.worker.gpu.sample.gumbel.gumbel_sample`
#    Why:
#       triton ops in vLLM perform not good on NPU. And there is no dispatch mechanism for triton ops.
#    How：
#       override triton ops in vLLM with ascend implementation
#    Related PR (if no, explain why):
#       Let vLLM support triton ops dispatch.
#    Future Plan:
#       Remove this patch when vLLM support the dispatch function.
#
-												port deepseekv2 and mtp to main branch (#429)

### What this PR does / why we need it?
This PR ports all the deepseek graph mode code and mtp code from v0.7.3
to the main branch
---------

Signed-off-by: SidaoY <1024863041@qq.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com>
Signed-off-by: mengwei805 <mengwei25@huawei.com>
Signed-off-by: libaokui <libaokui@huawei.com>
Signed-off-by: q00832892 <qiaoyang19@huawei.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Co-authored-by: SidaoY <1024863041@qq.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: Yizhou Liu <liuyizhou5@h-partners.com>
Co-authored-by: mengwei805 <mengwei25@huawei.com>
Co-authored-by: libaokui <libaokui@huawei.com>
											
										
										
											2025-04-19 17:38:18 +08:00
+								#
 								# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
 								# This file is a part of the vllm-ascend project.
 								#
 								# Licensed under the Apache License, Version 2.0 (the "License");
 								# you may not use this file except in compliance with the License.
 								# You may obtain a copy of the License at
 								#
 								#     http://www.apache.org/licenses/LICENSE-2.0
 								#
 								# Unless required by applicable law or agreed to in writing, software
 								# distributed under the License is distributed on an "AS IS" BASIS,
 								# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 								# See the License for the specific language governing permissions and
 								# limitations under the License.
-												[Patch] format patch module to make it more clear (#601)

Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-22 14:13:00 +08:00
 								# ----------------------------------------------------------------------------------
 								# This module manage the patch for vllm. There are two folders in this module:
 								# - platform: contains the patches applied before worker starts. It's called by
 								#             `vllm_ascend.utils.adapt_patch(is_global_patch=True)` in
 								#             `vllm_ascend.platform.NPUPlatform.pre_register_and_update()` function.
 								# - worker: contains the patches applied when worker starts. It's called by
 								#           `vllm_ascend.utils.adapt_patch(is_global_patch=False)` in
 								#           each worker's `__init__` function.
 								#
 								# Once a new patch is added in vllm-ascend, please add the patch description into this file as well.
 								# ----------------------------------------------------------------------------------
 								# What's Patched and how it works:
 								# --------------------------------
 								# * Platform Patch:
 								# =================
-												Update patch doc (#4869)

Update patch doc. After this PR is merged, all the new patch PR should
update this doc as well.
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-10 23:27:45 +08:00
+								# ** 1. File: platform/patch_distributed.py**
-												[Patch] format patch module to make it more clear (#601)

Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-22 14:13:00 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-												Update patch doc (#4869)

Update patch doc. After this PR is merged, all the new patch PR should
update this doc as well.
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-10 23:27:45 +08:00
+								#   1. `torch.distributed.all_reduce`, `torch.distributed.broadcast`
-												[Misc] Move lora patch file into lora module (#2797)

Cleanup useless file in patch module. Update the lora support list is OK
in vLLM Ascend, no need to patch vLLM


- vLLM version: v0.10.1.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/f4962a6d55a340ebb569d377c842deff7611d8f7

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-09-08 21:42:12 +08:00
+								#    Why:
 								#       tensor alignment for 310p
 								#    How：
 								#       rewrite all_reduce and broadcast in torch.distributed
 								#    Related PR (if no, explain why):
 								#       No, not ready yet.
 								#    Future Plan:
 								#       Find a better way to support tensor alignment for 310p without this patch.
-												[Model][MiniCPM] support MiniCPM (#645)

### What this PR does / why we need it?
This pr support minicpm in branch main. see
https://github.com/vllm-project/vllm-ascend/pull/164


### How was this patch tested?
test locally with minicpm

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
											
										
										
											2025-04-27 11:27:24 +08:00
+								#
-												[Misc] Drop deepseek patch (#6288)

We patched deepseek before since we notice asserterror raised by
transformers. Now due to transformers upgrade, the patch looks useless
now. Let's remove it.

- vLLM version: v0.14.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/dc917cceb877dfd13f98c538c4c96158047d98bd

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-01-29 14:45:50 +08:00
+								# ** 2. File: platform/patch_mamba_config.py**
-												Update patch doc (#4869)

Update patch doc. After this PR is merged, all the new patch PR should
update this doc as well.
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-10 23:27:45 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.model_executor.models.config.HybridAttentionMambaModelConfig.verify_and_update_config`
 								#    Why:
 								#       block size is set to 16 in vLLM which is not supported by Ascend.
 								#    How：
 								#       Set block size to 128 on npu.
 								#    Related PR (if no, explain why):
 								#       we'll fix this in vLLM soon.
 								#    Future Plan:
 								#       Remove this patch when vLLM merges the PR.
 								#
-												[Misc] Drop deepseek patch (#6288)

We patched deepseek before since we notice asserterror raised by
transformers. Now due to transformers upgrade, the patch looks useless
now. Let's remove it.

- vLLM version: v0.14.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/dc917cceb877dfd13f98c538c4c96158047d98bd

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-01-29 14:45:50 +08:00
+								# ** 3. File: platform/patch_multiproc_executor.py**
-												Update patch doc (#4869)

Update patch doc. After this PR is merged, all the new patch PR should
update this doc as well.
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-10 23:27:45 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.v1.executor.multiproc_executor.MultiprocExecutor`
 								#    Why:
 								#       vLLM create child process with daemon=True, which doesn't work with EPLB case, since EPLB will create
 								#       a new process which is not allowed by daemon=True.
 								#    How：
 								#       Set daemon=False in MultiprocExecutor.
 								#    Related PR (if no, explain why):
 								#       Find a way to support daemon=False in vLLM
 								#    Future Plan:
 								#       Remove this patch when vLLM fix the issue.
 								#
-												[Misc] Drop deepseek patch (#6288)

We patched deepseek before since we notice asserterror raised by
transformers. Now due to transformers upgrade, the patch looks useless
now. Let's remove it.

- vLLM version: v0.14.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/dc917cceb877dfd13f98c538c4c96158047d98bd

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-01-29 14:45:50 +08:00
+								# ** 4. File: platform/patch_sched_yield.py**
-												Update patch doc (#4869)

Update patch doc. After this PR is merged, all the new patch PR should
update this doc as well.
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-10 23:27:45 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.distributed.utils.USE_SCHED_YIELD`
 								#    Why:
 								#       os.sched_yield() doesn't work on Arm systems.
 								#    How：
 								#       avoid using os.sched_yield() on Arm systems.
 								#    Related PR (if no, explain why):
 								#       https://github.com/vllm-project/vllm/pull/30228
 								#    Future Plan:
 								#       Remove this patch when vLLM merge the PR.
 								#
-												[Misc] Drop deepseek patch (#6288)

We patched deepseek before since we notice asserterror raised by
transformers. Now due to transformers upgrade, the patch looks useless
now. Let's remove it.

- vLLM version: v0.14.1
- vLLM main:
https://github.com/vllm-project/vllm/commit/dc917cceb877dfd13f98c538c4c96158047d98bd

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-01-29 14:45:50 +08:00
+								# ** 5. File: platform/patch_balance_schedule.py**
-												[Main] [Patch] support balance scheduling patch (#5212)

### Motivation.

**Limitations of the current vLLM v1 scheduling strategy**
vLLM v1 scheduling currently enables chunkedprefill by default, which
processes prefill and decode requests simultaneously in a single
scheduling session. This can impact the overall system throughput and
performance in some scenarios.

Balance scheduling addresses this issue by synchronizing the number of
running queues across all schedulers to delay the scheduling of new
requests, thereby improving the overall system's steady-state decoding
time. This achieves:
✅Adding `balance_gather` to the scheduler synchronizes the number of
requests in the running queues between DPs.
✅Balance scheduling improves the decode steady-state time, thereby
increasing the overall output throughput of the inference system.


### Proposed Change.

 **1.Feature Overview**

In the vLLM scheduler, running requests (i.e., requests that are already
undergoing pre-filled computation) have the highest priority, followed
by waiting requests (i.e., requests that have not yet been computed).


As shown in the diagram above, when the entire inference system exits
from a steady state, the scheduler will schedule a batch of new requests
for prefill operations and then synchronize them among the dynamic
programming (DP) models. This can cause some DP models that are entirely
decoded to synchronize with the number of prefilled tokens. Frequent
prefill scheduling by certain DP models can lead to a deterioration in
the overall system output throughput.

Balance scheduling synchronizes the number of running queue requests
across different DPs, and only schedules new requests for prefilling
when at least every scheduler has fewer than max_nun_requst.

 **2.Implementation Design**

 **3.Experiment Results**
- Fixed-length input scenario: In the performance test scenario with
3.5K fixed-length input and 1.5K fixed-length output, the throughput
performance was improved by approximately **18%** after adding balance
scheduling.

| Method | Model | Input Len | Request Count | Output Len | BatchSize |
Average TTFT | Average TPOT | e2e duration | Input Token Throughput |
Output Token Throughput | Request Throughput
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
---- | ---- |
| Baseline | DeepSeekV3.1 | 3500 | 512 | 1500 | 128 | 6600 | 86.85 |
591.9s | 3030.5 | 1297.3 | 0.86 |
| Balance scheduling | DeepSeekV3.1 | 3500 | 512 | 1500 | 128 | 7012 |
70.63 | 501.7s | 3575.7 | 1530.7 | 1.02 |

**4.Demo PR**

[#29721 ](https://github.com/vllm-project/vllm/pull/29721)

---------

Signed-off-by: GDzhu01 <809721801@qq.com>
											
										
										
											2025-12-23 09:04:38 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.v1.engine.core.EngineCoreProc.run_engine_core`
 								#      `vllm.v1.core.sched.scheduler.Scheduler`
 								#    Why:
 								#       vLLM v1 scheduling currently enables chunkedprefill by default, which processes prefill and decode
 								#       requests simultaneously in a single scheduling session. This can impact the overall system throughput
 								#       and performance in some scenarios.
 								#    How：
 								#       Set environmental variables VLLM_ASCEND_BALANCE_SCHEDULING=1 in startup script.
 								#    Related PR (if no, explain why):
 								#       https://github.com/vllm-project/vllm/pull/29721
 								#    Future Plan:
 								#       Remove this patch when vLLM merge the PR.
-												[Model][VLM][Patch]Modify ascend affinity _merge_multimodal_embeddings (#3071)

### What this PR does / why we need it?

This PR aims to address the incompatibility of the `.masked_scatter_`
operation in the current `_merge_multimodal_embeddings` function on
Ascend. For now, it reverts to the previous version of the CPU
operation, which can be executed asynchronously on the device side to
enhance performance.

- vLLM version: v0.10.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/f225ea7dd98e9f29752e5c032cd4a8ee1d712f16

---------

Signed-off-by: booker123456 <945658361@qq.com>
											
										
										
											2025-09-24 10:25:28 +08:00
+								#
-												[Main2Main] Upgrade vLLM to 0303 (#6944)

### What this PR does / why we need it?
break:
- https://github.com/vllm-project/vllm/pull/34102 
Disable_full param replaced with valid_modes/invalid_modes API
- https://github.com/vllm-project/vllm/pull/35503
Now must return float compilation_time
- https://github.com/vllm-project/vllm/pull/35564
New sequence_lengths param added
- https://github.com/vllm-project/vllm/pull/33807
A check was performed (if runner_backend != "auto")
- https://github.com/vllm-project/vllm/pull/34861
`BaseDeviceCommunicator` now accesses PyTorch's internal `pg_map` to
check process group state
- https://github.com/vllm-project/vllm/pull/35274

**Important change:**
- https://github.com/vllm-project/vllm/pull/28672

`matcher_utils` directly accesses `torch.ops._C.*` during the import
phase. In the Ascend environment, some unregistered ops trigger
`AttributeError`, causing e2e initialization failure.

https://github.com/vllm-project/vllm-ascend/actions/runs/22607260487/job/65502047131#step:10:2323

https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/matcher_utils.py#L29

This PR adds temporary compatibility placeholders (rms_norm,
fused_add_rms_norm, rotate_embedding, static/dynamic fp8 quant,
silu_and_mul) to
`vllm_ascend/patch/platform/patch_fusion_matcher_compat_ops.py` to
ensure no crashes during the import phase. Upstream repairs will be
considered later.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/15d76f74e2fdb12a95ea00f0ca283acf6219a2b7

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: Meihan-chen <jcccx.cmh@gmail.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>
											
										
										
											2026-03-06 09:08:52 +08:00
+								# ** 6. File: platform/patch_fusion_matcher_compat_ops.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `torch.ops._C.rms_norm`, `torch.ops._C.fused_add_rms_norm`,
 								#    Why:
 								#       upstream vLLM initializes fusion matcher global operators at import time.
 								#       On Ascend environment these symbols may be absent and cause import failure.
 								#    How：
 								#       inject placeholders only when the symbols are missing so import can continue.
 								#    Related PR (if no, explain why):
 								#       temporary compatibility patch before upstream adjustment is merged.
 								#    Future Plan:
 								#       remove this patch once upstream no longer requires these global symbols or
 								#       provides a backend-safe initialization path.
 								#
-												[Model] Support Minimax-m2.5 on NPU (#7105)

### What this PR does / why we need it?

Initial version to support minimax-m2.5 on vllm-ascend. 
This commit coverting original fp8 weight to a quantilized bf16 to
support Minimax-m2.5 on NPU.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d

### Test Report
Self tested precision summary, where the official precision score of
AIME2025 is 86.3
<img width="426" height="84" alt="image"
src="https://github.com/user-attachments/assets/a3ce2452-92fa-4713-962e-862248e0b61a"
/>

---------

Signed-off-by: limuyuan <limuyuan3@huawei.com>
Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com>
Co-authored-by: limuyuan <limuyuan3@huawei.com>
											
										
										
											2026-03-11 00:12:02 +08:00
+								# ** 7. File: platform/patch_minimax_m2_config.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.config.model.ModelConfig._verify_quantization`
 								#    Why:
 								#       MiniMax-M2 fp8 checkpoints on NPU may fail upstream quantization validation.
 								#       vllm-ascend needs to disable fp8 quantization and load bf16 dequantized
 								#       weights in worker-side patches instead.
 								#    How：
 								#       Monkey-patch `_verify_quantization` and intercept platform quantization
 								#       verification to force `cfg.quantization=None` for MiniMax-M2 fp8 on NPU.
 								#    Related PR (if no, explain why):
 								#       No, upstream behavior differs across versions and needs discussion.
 								#    Future Plan:
 								#       Remove this patch once upstream supports MiniMax-M2 fp8 on NPU or provides
 								#       a backend-safe validation / override mechanism.
 								#
 								#   2. `vllm.config.model.ModelConfig._verify_cuda_graph`
 								#    Why:
 								#       For MiniMax-M2 on NPU with ACL graph capture enabled, HCCL op expansion
 								#       mode affects graph shape coverage. Users may forget to set it.
 								#    How：
 								#       If user doesn't set it, set `HCCL_OP_EXPANSION_MODE=AIV` for this model
 								#       and log a warning when a different value is detected.
 								#    Related PR (if no, explain why):
 								#       No, this is an environment-specific tuning knob.
 								#    Future Plan:
 								#       Remove this patch if upstream provides an official NPU graph-capture
 								#       guidance / auto-configuration path for HCCL.
 								#
-												[v0.18.0] Apply Eagle3 to MiniMax-M2.5 (#7619) (#7714)

### What this PR does / why we need it?
Apply Eagle3 to MiniMax-M2.5 to increase model performance This will be
discard after Eagle3 weight for MiniMax-M2.5 releases and code change
accepted by official repo
https://github.com/vllm-project/vllm/pull/37512/changes
backport: #7619

- vLLM version: v0.18.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ed359c497a728f08b5b41456c07a688ccd510fbc

Signed-off-by: limuyuan <limuyuan3@huawei.com>
Co-authored-by: limuyuan <limuyuan3@huawei.com>
											
										
										
											2026-03-27 18:33:29 +08:00
+								#   3. `vllm.config.speculative.SpeculativeConfig._verify_args`
 								#    Why:
 								#       Upstream vLLM's eagle3/extract_hidden_states restricts target model types
 								#       via a whitelist. MiniMax-M2 should be allowed once the worker-side model
 								#       can emit auxiliary hidden states.
 								#    How：
 								#       Monkey-patch `_verify_args` to bypass only the whitelist ValueError for
 								#       MiniMax model_type when method is eagle3/extract_hidden_states.
 								#       SpeculativeConfig is a Pydantic dataclass (`@config`); init validation calls
 								#       `__pydantic_decorators__.model_validators["_verify_args"].func`, so that
 								#       `Decorator.func` must be replaced (not only `SpeculativeConfig._verify_args`),
 								#       then `rebuild_dataclass(SpeculativeConfig, force=True)`.
 								#       If `VllmConfig` was imported earlier, also `rebuild_dataclass(VllmConfig, ...)`
 								#       so nested `speculative_config` validation does not use a stale schema.
 								#    Related PR (if no, explain why):
 								#       https://github.com/vllm-project/vllm/pull/37512
 								#    Future Plan:
 								#       Remove this patch once upstream whitelist includes MiniMax.
 								#
 								#   4. `vllm.model_executor.models.registry` (spec decode aliases)
 								#    Why:
 								#       Some Eagle3 draft checkpoints may declare a MiniMax-specific architecture
 								#       string while reusing the shared Eagle3 implementation.
 								#    How：
 								#       Register `Eagle3MiniMaxM2ForCausalLM` as an alias pointing to the
 								#       existing Eagle3 implementation in the speculative decoding registry.
 								#    Related PR (if no, explain why):
 								#       https://github.com/vllm-project/vllm/pull/37512
 								#    Future Plan:
 								#       Drop the alias once upstream registry includes it or the checkpoint
 								#       standardizes architecture strings.
 								#
-												[bugfix] restore pr-7029 and fix patch error  (#7294)

### What this PR does / why we need it?
This PR restores #7029, which adds W8A8C8 support for dsv3.2/glm5 using
the `lightning_indexer_quant` ops in the pd-mix stage.

The original PR was reverted by #7288 because the patch did not work
with the recompute scheduler.

This PR also fixes the patching issue so that it works correctly with
the recompute scheduler.

### Does this PR introduce _any_ user-facing change?
Yes. To enable LI C8, users need to set the `enable_sparse_c8` option to
`"true"` in `additional_config`.

- vLLM version: v0.17.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d
---------
Signed-off-by: rjg-lyh <1318825571@qq.com>
											
										
										
											2026-03-16 15:39:42 +08:00
+								# ** 8. File: platform/patch_kv_cache_interface.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.v1.kv_cache_interface.MLAAttentionSpec`
 								#    Why:
 								#       The default `MLAAttentionSpec` is mainly built around `kv_lora_rank`
 								#       and `qk_rope_head_dim`. On NPU, we also use this class to describe DSA
 								#       models. Unlike the GPU path, where cache management is handled by an
 								#       additional indexer module, extending this class directly simplifies the
 								#       corresponding `model_runner` implementation on NPU.
 								#
 								#       This patch also adds Sparse C8 support for DSA models on NPU. As part
 								#       of that support, members such as `page_size_bytes` need to be adapted,
 								#       so they are overridden here as well to preserve overall readability.
 								#    How:
 								#       This patch subclasses the original implementation, overrides selected
 								#       methods, and adds DSA-specific attributes and helpers with default
 								#       values where needed.
 								#    Related PR (if no, explain why):
 								#       https://github.com/vllm-project/vllm/pull/25896
 								#    Future Plan:
 								#       Remove this patch after the upcoming KV cache spec refactor.
 								#
-												[v0.18.0][Bugfix][Platform] Fix MiniMax M2 reasoning token usage accounting (#7700)

### What this PR does / why we need it?
This backports the MiniMax M2 reasoning-token usage accounting fix onto
`releases/v0.18.0` for vllm-ascend.

The release branch does not include the other local GLM patch commit, so
this PR keeps the MiniMax change self-contained by:
- registering `patch_minimax_usage_accounting` on the release branch
- backporting `completion_tokens_details.reasoning_tokens` into chat
usage generation
- fixing MiniMax reasoning token counting for `</think>`-delimited
outputs without depending on the GLM suffix patch

### Does this PR introduce _any_ user-facing change?
Yes. OpenAI-compatible chat usage accounting for MiniMax M2 responses
now reports corrected reasoning token counts on the release branch.

### How was this patch tested?
- `python -m compileall
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py`
- `python - <<'PY'` import check for
`vllm_ascend.patch.platform.patch_minimax_usage_accounting` on top of
`releases/v0.18.0`

No targeted automated regression test exists for this release-branch
backport yet, so I validated syntax and module import compatibility on
the release branch.

---------

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
											
										
										
											2026-03-27 10:45:28 +08:00
+								# ** 9. File: platform/patch_minimax_usage_accounting.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.entrypoints.openai.chat_completion.serving.OpenAIServingChat`
 								#      `vllm.entrypoints.openai.engine.protocol.UsageInfo`
 								#      `vllm.reasoning.minimax_m2_reasoning_parser`
 								#    Why:
 								#       MiniMax M2 reasoning outputs use `</think>` as the only boundary token,
 								#       but the runtime usage accounting path either omits reasoning token
 								#       details entirely or counts them incorrectly.
 								#    How：
 								#       Monkey-patch the MiniMax reasoning token counters, extend `UsageInfo`
 								#       with `completion_tokens_details.reasoning_tokens`, and update chat
 								#       streaming/non-streaming usage generation to propagate the corrected
 								#       counts.
 								#    Related PR (if no, explain why):
 								#       https://github.com/vllm-project/vllm/pull/37955
 								#    Future Plan:
 								#       Remove this patch once the upstream MiniMax usage-accounting fix is in
 								#       the runtime vLLM version used by vllm-ascend.
 								#
-												[Bugfix][Platform] Fix GLM47 tool-call finish backfill (#7710)

### What this PR does / why we need it?
This rebases the GLM47 tool-call parser fix onto `releases/v0.18.0`
after the MiniMax usage-accounting patch merged upstream on March 27,
2026.

It fixes OpenAI chat tool-call streaming for GLM47 by:
- draining terminal parser chunks that contain both the final argument
text and the closing `</tool_call>` suffix
- computing finish backfill from the tool argument bytes actually
emitted to the client, instead of trusting parser-internal buffered
state
- adding focused regression tests for finish backfill and terminal chunk
handling

### Does this PR introduce _any_ user-facing change?
Yes. GLM47 OpenAI-compatible streaming tool-call responses now emit
correct final chunks and argument payloads on `releases/v0.18.0`.

### How was this patch tested?
- `pytest -q tests/ut/patch/platform/test_patch_glm_tool_call_parser.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py`
- `python -m pre_commit run --files
vllm_ascend/patch/platform/patch_glm_tool_call_parser.py
tests/ut/patch/platform/test_patch_glm_tool_call_parser.py
vllm_ascend/patch/platform/__init__.py vllm_ascend/patch/__init__.py`

---------

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
											
										
										
											2026-03-28 09:15:04 +08:00
+								# ** 10. File: platform/patch_glm_tool_call_parser.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.entrypoints.openai.chat_completion.serving.OpenAIServingChat`
 								#      `vllm.tool_parsers.glm4_moe_tool_parser.Glm4MoeModelToolParser`
 								#    Why:
 								#       GLM-4.7 / GLM-4.5 tool-call streaming on the release runtime still has
 								#       two independent finish-path bugs:
 								#       1. the parser can leave a terminal `<arg_value>... </tool_call>` chunk
 								#          partially undrained, and
 								#       2. finish backfill trusts the parser's internal accumulated arguments
 								#          instead of the argument bytes actually sent to the client.
 								#       Together these can drop a full string value or emit only a suffix like
 								#       `"}` in the final SSE chunk even when non-stream output is correct.
 								#    How：
 								#       Monkey-patch the GLM parser to keep draining a single chunk through
 								#       terminal state transitions, and monkey-patch chat streaming to track
 								#       per-tool arguments actually emitted to the client before computing the
 								#       finish-chunk suffix. The suffix logic still tolerates mixed JSON
 								#       whitespace styles from GLM tool parsers.
 								#    Related PR (if no, explain why):
 								#       https://github.com/vllm-project/vllm/pull/37845
 								#       https://github.com/vllm-project/vllm/pull/33218
 								#    Future Plan:
 								#       Remove this patch once both the GLM parser drain fix and the serving
 								#       finish-backfill fix are present in the runtime vLLM version used by
 								#       vllm-ascend.
 								#
-												[Patch] format patch module to make it more clear (#601)

Format patch module to make it more clear. 
Add the patch doc description, the new patch must follow this guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-04-22 14:13:00 +08:00
+								# * Worker Patch:
 								# ===============
-												Update patch doc (#4869)

Update patch doc. After this PR is merged, all the new patch PR should
update this doc as well.
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-10 23:27:45 +08:00
+								#
-												[Patch][Misc] Cleanup and update patches (#6802)

### What this PR does / why we need it?

This PR performs a cleanup and update of the patch mechanism in
`vllm-ascend`.

- Removes several obsolete patches: `patch_deepseek.py`.
- Updates the central patch documentation in
`vllm_ascend/patch/__init__.py` to reflect these removals and additions,
re-numbering and re-organizing the patch list for better clarity.

### Does this PR introduce _any_ user-facing change?

No. These are internal changes to the patching mechanism and should not
affect users.

### How was this patch tested?

CI passed with new added/existing test.

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-26 14:45:33 +08:00
+								# ** 1. File: worker/patch_distributed.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-												Update patch doc (#4869)

Update patch doc. After this PR is merged, all the new patch PR should
update this doc as well.
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-10 23:27:45 +08:00
+								#   1. `vllm.distributed.parallel_state.GroupCoordinator`
 								#    Why:
 								#       vllm doesn't support all_to_all for GroupCoordinator.
 								#    How：
 								#       Add all_to_all implementation for GroupCoordinator.
 								#    Related PR (if no, explain why):
 								#       No, we should use vlLM all2all manager to support all_to_all for npu.
 								#    Future Plan:
 								#       Remove this patch when the refactor of all2all manager is done.
 								#
-												[Patch][Misc] Cleanup and update patches (#6802)

### What this PR does / why we need it?

This PR performs a cleanup and update of the patch mechanism in
`vllm-ascend`.

- Removes several obsolete patches: `patch_deepseek.py`.
- Updates the central patch documentation in
`vllm_ascend/patch/__init__.py` to reflect these removals and additions,
re-numbering and re-organizing the patch list for better clarity.

### Does this PR introduce _any_ user-facing change?

No. These are internal changes to the patching mechanism and should not
affect users.

### How was this patch tested?

CI passed with new added/existing test.

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-26 14:45:33 +08:00
+								# ** 2. File: worker/patch_multimodal_merge.py**
-												Update patch doc (#4869)

Update patch doc. After this PR is merged, all the new patch PR should
update this doc as well.
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-10 23:27:45 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.model_executor.models.utils._merge_multimodal_embeddings`
-												[feat] support customized and separated hccl_buffer_size for process group initialization (#3073)

### What this PR does / why we need it?
Currently, users have to set `HCCL_BUFFSIZE` to 512~1024 to perform mc2
operators (dispatch and combine) while running moe models with large
`ep_size` and `batch_size`. This environmental variable not only affects
allocated VRAM for mc2 group, but also increases VRAM allocation for dp,
tp & ep groups, leading to significant kvcache and free_memory drops.
This PR supports to automatically calculate and set `hccl_buffer_size`
for each process group **(except mc2 group)** separately when users set
`HCCL_BUFFSIZE` for mc2 group. This can significantly reduce wasted
buffer_size set for dp, tp & ep groups.

Note that current mc2 operators can only perform communication space
partitioning based on `HCCL_BUFFSIZE` configuration. Once they support
`hccl_buffer_size` configuration with `pg_options` while initializing
process group, we'll caculate the required buffer size and users would
avoid set `HCCL_BUFFSIZE` themselves.

### Does this PR introduce _any_ user-facing change?
No. 

### How was this patch tested?
We performed E2E serving with deepseek_r1 initializing DP/TP/EP/MC2
process group and observed significant kv_cache and free_memory
increase!


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
											
										
										
											2025-10-11 15:55:22 +08:00
+								#    Why:
-												Update patch doc (#4869)

Update patch doc. After this PR is merged, all the new patch PR should
update this doc as well.
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-10 23:27:45 +08:00
+								#       '_merge_multimodal_embeddings' func of vllm is incompatible with Ascend.
 								#    How：
 								#       Replace with CPU operation that can be executed asynchronously.
-												[feat] support customized and separated hccl_buffer_size for process group initialization (#3073)

### What this PR does / why we need it?
Currently, users have to set `HCCL_BUFFSIZE` to 512~1024 to perform mc2
operators (dispatch and combine) while running moe models with large
`ep_size` and `batch_size`. This environmental variable not only affects
allocated VRAM for mc2 group, but also increases VRAM allocation for dp,
tp & ep groups, leading to significant kvcache and free_memory drops.
This PR supports to automatically calculate and set `hccl_buffer_size`
for each process group **(except mc2 group)** separately when users set
`HCCL_BUFFSIZE` for mc2 group. This can significantly reduce wasted
buffer_size set for dp, tp & ep groups.

Note that current mc2 operators can only perform communication space
partitioning based on `HCCL_BUFFSIZE` configuration. Once they support
`hccl_buffer_size` configuration with `pg_options` while initializing
process group, we'll caculate the required buffer size and users would
avoid set `HCCL_BUFFSIZE` themselves.

### Does this PR introduce _any_ user-facing change?
No. 

### How was this patch tested?
We performed E2E serving with deepseek_r1 initializing DP/TP/EP/MC2
process group and observed significant kv_cache and free_memory
increase!


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
											
										
										
											2025-10-11 15:55:22 +08:00
+								#    Related PR (if no, explain why):
-												Update patch doc (#4869)

Update patch doc. After this PR is merged, all the new patch PR should
update this doc as well.
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-10 23:27:45 +08:00
+								#       This is a bug by Ascend only. It can' be fixed in vLLM.
-												[feat] support customized and separated hccl_buffer_size for process group initialization (#3073)

### What this PR does / why we need it?
Currently, users have to set `HCCL_BUFFSIZE` to 512~1024 to perform mc2
operators (dispatch and combine) while running moe models with large
`ep_size` and `batch_size`. This environmental variable not only affects
allocated VRAM for mc2 group, but also increases VRAM allocation for dp,
tp & ep groups, leading to significant kvcache and free_memory drops.
This PR supports to automatically calculate and set `hccl_buffer_size`
for each process group **(except mc2 group)** separately when users set
`HCCL_BUFFSIZE` for mc2 group. This can significantly reduce wasted
buffer_size set for dp, tp & ep groups.

Note that current mc2 operators can only perform communication space
partitioning based on `HCCL_BUFFSIZE` configuration. Once they support
`hccl_buffer_size` configuration with `pg_options` while initializing
process group, we'll caculate the required buffer size and users would
avoid set `HCCL_BUFFSIZE` themselves.

### Does this PR introduce _any_ user-facing change?
No. 

### How was this patch tested?
We performed E2E serving with deepseek_r1 initializing DP/TP/EP/MC2
process group and observed significant kv_cache and free_memory
increase!


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
											
										
										
											2025-10-11 15:55:22 +08:00
+								#    Future Plan:
-												Update patch doc (#4869)

Update patch doc. After this PR is merged, all the new patch PR should
update this doc as well.
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-10 23:27:45 +08:00
+								#       Identify this pattern in torch-npu and remove this patch.
 								#
-												[Patch][Misc] Cleanup and update patches (#6802)

### What this PR does / why we need it?

This PR performs a cleanup and update of the patch mechanism in
`vllm-ascend`.

- Removes several obsolete patches: `patch_deepseek.py`.
- Updates the central patch documentation in
`vllm_ascend/patch/__init__.py` to reflect these removals and additions,
re-numbering and re-organizing the patch list for better clarity.

### Does this PR introduce _any_ user-facing change?

No. These are internal changes to the patching mechanism and should not
affect users.

### How was this patch tested?

CI passed with new added/existing test.

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-26 14:45:33 +08:00
+								# ** 3. File: worker/patch_bert.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.model_executor.models.bert._encode_token_type_ids`
 								#      `vllm.model_executor.models.bert._decode_token_type_ids`
-												[Feat] Supports Aclgraph for bge-m3 (#3171)

### What this PR does / why we need it?
[Feat] Supports Aclgraph for bge-m3

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```
pytest -s tests/e2e/singlecard/test_embedding.py
pytest -s tests/e2e/singlecard/test_embedding_aclgraph.py
```
to start an online server with bs 10, each batch's seq length=8192, we
set --max-num-batched-tokens=8192*10 to ensure encoder is not chunked:
```
vllm serve /home/data/bge-m3 --max_model_len 1024 --served-model-name "bge-m3" --task embed --host 0.0.0.0 --port 9095 --max-num-batched-tokens 81920 --compilation-config '{"cudagraph_capture_sizes":[8192, 10240, 20480, 40960, 81920]}'
```
For bs10, each batch's seq length=8192, QPS is improved from 85 to 104,
which is a 22% improvement, lots of host bound is reduced.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
Co-authored-by: wangyongjun <1104133197@qq.com>
											
										
										
											2025-10-14 23:07:45 +08:00
+								#    Why:
 								#       shift operation in `_encode_token_type_ids` and `_decode_token_type_ids` cannot run in ascend aclgraph mode
 								#    How：
 								#       Replace shift operation with multiplication and division.
 								#    Related PR (if no, explain why):
 								#       No, this need CANN add an aclnn shift operation
 								#    Future Plan:
 								#       Revert this when CANN support shift aclnn operation
-												[BugFix][main] Fix quantization related mtp bug with patch (#3620)

vLLM 0.11.0 didn't bring PR
(https://github.com/vllm-project/vllm/pull/25805) thus missing the
prefix of mtp's SharedHead. This PR fixes this bug with a patch to
vllm's deepseek_mtp. main also need this bugfix to support vllm's
v0.11.0

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
											
										
										
											2025-10-23 09:54:31 +08:00
+								#
-												[Patch][Misc] Cleanup and update patches (#6802)

### What this PR does / why we need it?

This PR performs a cleanup and update of the patch mechanism in
`vllm-ascend`.

- Removes several obsolete patches: `patch_deepseek.py`.
- Updates the central patch documentation in
`vllm_ascend/patch/__init__.py` to reflect these removals and additions,
re-numbering and re-organizing the patch list for better clarity.

### Does this PR introduce _any_ user-facing change?

No. These are internal changes to the patching mechanism and should not
affect users.

### How was this patch tested?

CI passed with new added/existing test.

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-26 14:45:33 +08:00
+								# ** 4. File: worker/patch_triton.py**
-												Update patch doc (#4869)

Update patch doc. After this PR is merged, all the new patch PR should
update this doc as well.
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-10 23:27:45 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-												[Feature] implement eagle spec decoding for model runner v2 (#5840)

### What this PR does / why we need it?
this pr implement eagle spec decoding for model runner v2, please see
RFC https://github.com/vllm-project/vllm-ascend/issues/5208

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
vLLM version: v0.13.0

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
											
										
										
											2026-01-14 09:18:05 +08:00
+								#   1. `vllm.model_executor.layers.mamba.ops`, `vllm.model_executor.layers.fla.ops`,
 								#      `vllm.v1.worker.gpu.sample.gumbel.gumbel_sample`
-												Update patch doc (#4869)

Update patch doc. After this PR is merged, all the new patch PR should
update this doc as well.
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-12-10 23:27:45 +08:00
+								#    Why:
 								#       triton ops in vLLM perform not good on NPU. And there is no dispatch mechanism for triton ops.
 								#    How：
 								#       override triton ops in vLLM with ascend implementation
 								#    Related PR (if no, explain why):
 								#       Let vLLM support triton ops dispatch.
 								#    Future Plan:
 								#       Remove this patch when vLLM support the dispatch function.
 								#
-												[Patch][Misc] Cleanup and update patches (#6802)

### What this PR does / why we need it?

This PR performs a cleanup and update of the patch mechanism in
`vllm-ascend`.

- Removes several obsolete patches: `patch_deepseek.py`.
- Updates the central patch documentation in
`vllm_ascend/patch/__init__.py` to reflect these removals and additions,
re-numbering and re-organizing the patch list for better clarity.

### Does this PR introduce _any_ user-facing change?

No. These are internal changes to the patching mechanism and should not
affect users.

### How was this patch tested?

CI passed with new added/existing test.

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-26 14:45:33 +08:00
+								# ** 5. File: worker/patch_qwen3_next_mtp.py**
-												[BugFix][main] Adapted Qwen3-Next-MTP to chunked prefill (#4770)

### What this PR does / why we need it?
The pad `-1` modification is from
https://github.com/vllm-project/vllm/pull/25743.

It still has bugs for batched chunked prefill.

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: drslark <slarksblood@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-10 22:54:24 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.v1.worker.utils.bind_kv_cache`
 								#    Why:
 								#       'bind_kv_cache' func will raise an exception when current_platform is npu.
 								#    How：
 								#       Replace with a new bind_kv_cache.
 								#       Skip the raise.
 								#    Related PR (if no, explain why):
-												[main][BugFix] Fixed an accuracy bug of Qwen3-next-MTP when batched inferring (#4932)

### What this PR does / why we need it?
Fixes an accuracy bug of Qwen3-next-MTP when batched inferring.
It is descibed in
https://github.com/vllm-project/vllm-ascend/issues/4930.

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: drslark <slarksblood@qq.com>
											
										
										
											2025-12-15 13:22:30 +08:00
+								#       It need discuss.
-												[BugFix][main] Adapted Qwen3-Next-MTP to chunked prefill (#4770)

### What this PR does / why we need it?
The pad `-1` modification is from
https://github.com/vllm-project/vllm/pull/25743.

It still has bugs for batched chunked prefill.

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: drslark <slarksblood@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-10 22:54:24 +08:00
+								#    Future Plan:
 								#       Remove this patch after discussing with vllm community and adapting bind_kv_cache to npu.
 								#
-												[Patch][Misc] Cleanup and update patches (#6802)

### What this PR does / why we need it?

This PR performs a cleanup and update of the patch mechanism in
`vllm-ascend`.

- Removes several obsolete patches: `patch_deepseek.py`.
- Updates the central patch documentation in
`vllm_ascend/patch/__init__.py` to reflect these removals and additions,
re-numbering and re-organizing the patch list for better clarity.

### Does this PR introduce _any_ user-facing change?

No. These are internal changes to the patching mechanism and should not
affect users.

### How was this patch tested?

CI passed with new added/existing test.

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-26 14:45:33 +08:00
+								# ** 6. File: worker/patch_rejection_sampler.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.v1.sample.rejection_sampler`
 								#    Why:
 								#       - some functions from `rejection_sampler` are not supported or slow on npu.
 								#    How：
 								#       - add npu_top_k_top_p to 'apply_sampling_constraints' func
 								#       - add custom triton kernel to `expand_batch_to_tokens` and `rejection_sample`
 								#    Related PR (if no, explain why):
 								#       Let vLLM support triton ops dispatch.
 								#    Future Plan:
 								#       1. make these functions as class func of RejectionSampler, create AscendRejectionSampler
 								#           to override them, then delete the patch file `worker/patch_rejection_sampler.py`.
 								#       2. make these functions as costom op, then remove AscendRejectionSampler
 								#
 								## ** 7. File: worker/patch_module.py**
-												[BugFix][main] Adapted Qwen3-Next-MTP to chunked prefill (#4770)

### What this PR does / why we need it?
The pad `-1` modification is from
https://github.com/vllm-project/vllm/pull/25743.

It still has bugs for batched chunked prefill.

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: drslark <slarksblood@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-10 22:54:24 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.v1.attention.backends.gdn_attn.torch.argsort`
 								#    Why:
-												[main][BugFix] Fixed an accuracy bug of Qwen3-next-MTP when batched inferring (#4932)

### What this PR does / why we need it?
Fixes an accuracy bug of Qwen3-next-MTP when batched inferring.
It is descibed in
https://github.com/vllm-project/vllm-ascend/issues/4930.

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: drslark <slarksblood@qq.com>
											
										
										
											2025-12-15 13:22:30 +08:00
+								#       1. 'torch.argsort' func of npu does not support bool.
 								#       2. Without `stable=True`, the output will have a lot of redundant tokens.
-												[BugFix][main] Adapted Qwen3-Next-MTP to chunked prefill (#4770)

### What this PR does / why we need it?
The pad `-1` modification is from
https://github.com/vllm-project/vllm/pull/25743.

It still has bugs for batched chunked prefill.

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: drslark <slarksblood@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-10 22:54:24 +08:00
+								#    How：
-												[main][BugFix] Fixed an accuracy bug of Qwen3-next-MTP when batched inferring (#4932)

### What this PR does / why we need it?
Fixes an accuracy bug of Qwen3-next-MTP when batched inferring.
It is descibed in
https://github.com/vllm-project/vllm-ascend/issues/4930.

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: drslark <slarksblood@qq.com>
											
										
										
											2025-12-15 13:22:30 +08:00
+								#       Replace with a new torch.argsort that will cast the input to torch.int32
 								#       and do stable sort.
-												[BugFix][main] Adapted Qwen3-Next-MTP to chunked prefill (#4770)

### What this PR does / why we need it?
The pad `-1` modification is from
https://github.com/vllm-project/vllm/pull/25743.

It still has bugs for batched chunked prefill.

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: drslark <slarksblood@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-10 22:54:24 +08:00
+								#    Related PR (if no, explain why):
-												[main][BugFix] Fixed an accuracy bug of Qwen3-next-MTP when batched inferring (#4932)

### What this PR does / why we need it?
Fixes an accuracy bug of Qwen3-next-MTP when batched inferring.
It is descibed in
https://github.com/vllm-project/vllm-ascend/issues/4930.

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: drslark <slarksblood@qq.com>
											
										
										
											2025-12-15 13:22:30 +08:00
+								#       1. It depends on torch_npu.
 								#       2. https://github.com/vllm-project/vllm/pull/30632
-												[BugFix][main] Adapted Qwen3-Next-MTP to chunked prefill (#4770)

### What this PR does / why we need it?
The pad `-1` modification is from
https://github.com/vllm-project/vllm/pull/25743.

It still has bugs for batched chunked prefill.

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: drslark <slarksblood@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-10 22:54:24 +08:00
+								#    Future Plan:
 								#       Remove this patch when bool is supported in 'torch.argsort' func of npu.
-												[main][BugFix] Fixed an accuracy bug of Qwen3-next-MTP when batched inferring (#4932)

### What this PR does / why we need it?
Fixes an accuracy bug of Qwen3-next-MTP when batched inferring.
It is descibed in
https://github.com/vllm-project/vllm-ascend/issues/4930.

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: drslark <slarksblood@qq.com>
											
										
										
											2025-12-15 13:22:30 +08:00
+								#       Make 'torch.argsort' in `vllm.v1.attention.backends.gdn_attn` be stable.
-												[BugFix][main] Adapted Qwen3-Next-MTP to chunked prefill (#4770)

### What this PR does / why we need it?
The pad `-1` modification is from
https://github.com/vllm-project/vllm/pull/25743.

It still has bugs for batched chunked prefill.

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: drslark <slarksblood@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
											
										
										
											2025-12-10 22:54:24 +08:00
+								#
-												[Feature]  Optimize Qwen3.5/Qwen3Next GDN prefill by prebuilding chunk metadata (#7487)

### What this PR does / why we need it?
This PR optimizes the Qwen3.5 and Qwen3Next GDN prefill path on Ascend
by reducing host/device synchronization overhead.

The current implementation of the `chunk_gated_delta_rule` path for
variable-length sequences prepares chunk metadata during the forward
pass. This approach triggers frequent CPU intervention and host/device
round-trips. When running prefill-heavy workloads with asynchronous
scheduling enabled, these synchronizations result in execution "bubbles"
and prefill stalling (stuttering). **Note that this does not cause
asynchronous scheduling to fail; rather, it prevents the system from
reaching its theoretical throughput due to these unnecessary stalls.**

To resolve this, the patch moves metadata preparation out of the hot
path:
- **Prebuilt Metadata:** All non-speculative varlen chunk metadata for
GDN is now prebuilt on the CPU.
- **Asynchronous Transfer:** Staging buffers are kept in pinned memory
and transferred to the NPU asynchronously.
- **Integration:** The prebuilt bundle is attached to GDN attention
metadata via `patch_gdn_attn.py` and passed into Triton wrappers.
- **Backward Compatibility:** Triton wrappers fall back to the legacy
preparation path if no prebuilt metadata is provided.

- vLLM version: v0.17.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/8b6325758cce5f9c36d38f2462edbd368b97a07c
---------
Signed-off-by: maoxx241 <maomaoyu870@gmail.com>
											
										
										
											2026-03-22 23:09:23 +08:00
+								# ** 7a. File: worker/patch_gdn_attn.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.v1.attention.backends.gdn_attn.GDNAttentionMetadataBuilder.build`
 								#    Why:
 								#       Qwen3.5/Qwen3Next GDN prefill on NPU needs prebuilt varlen chunk metadata
 								#       to avoid forward-time host round-trips that break async scheduling.
 								#    How：
 								#       Monkey-patch the upstream builder in-place, keep upstream code untouched,
 								#       and attach prebuilt device metadata bundle onto the returned attention
 								#       metadata object for Ascend-specific consumers.
 								#    Future Plan:
 								#       Remove this patch when upstream exposes a backend hook for extending GDN
 								#       metadata or when the optimization is accepted upstream directly.
 								#
-												[Patch][Misc] Cleanup and update patches (#6802)

### What this PR does / why we need it?

This PR performs a cleanup and update of the patch mechanism in
`vllm-ascend`.

- Removes several obsolete patches: `patch_deepseek.py`.
- Updates the central patch documentation in
`vllm_ascend/patch/__init__.py` to reflect these removals and additions,
re-numbering and re-organizing the patch list for better clarity.

### Does this PR introduce _any_ user-facing change?

No. These are internal changes to the patching mechanism and should not
affect users.

### How was this patch tested?

CI passed with new added/existing test.

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-26 14:45:33 +08:00
+								# ** 8. File: worker/patch_qwen3_next.py**
-												qwen3_next add triton ops : fused_qkvzba_split_reshape (#4788)

### What this PR does / why we need it?
add triton ops fused_qkvzba_split_reshape_cat for qwen3_next
GatedDeltaNet
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT 
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: ZT-AIA <1028681969@qq.com>
Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>
											
										
										
											2025-12-18 11:31:04 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.model_executor.models.qwen3_next.Qwen3NextGatedDeltaNet.forward`
 								#    Why:
 								#       The Qwen3Next GatedDeltaNet forward cannot directly add custom operators.
 								#    How：
 								#       Add a branch in Qwen3NextGatedDeltaNet.forward to adapt to fused_qkvzba_split_reshape_cat.
 								#    Related PR (if no, explain why):
 								#       https://github.com/vllm-project/vllm/pull/30863
 								#    Future Plan:
 								#       Remove this patch when vLLM support these operators.
 								#
-												[Patch][Misc] Cleanup and update patches (#6802)

### What this PR does / why we need it?

This PR performs a cleanup and update of the patch mechanism in
`vllm-ascend`.

- Removes several obsolete patches: `patch_deepseek.py`.
- Updates the central patch documentation in
`vllm_ascend/patch/__init__.py` to reflect these removals and additions,
re-numbering and re-organizing the patch list for better clarity.

### Does this PR introduce _any_ user-facing change?

No. These are internal changes to the patching mechanism and should not
affect users.

### How was this patch tested?

CI passed with new added/existing test.

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-26 14:45:33 +08:00
+								#   2. `vllm.model_executor.models.qwen3_next.Qwen3NextGatedDeltaNet._forward_core`
-												[pref] qwen3_next add triton ops : fused_sigmoid_gating_delta_rule_update (#4818)

### What this PR does / why we need it?
qwen3_next add fused_sigmoid_gating_delta_rule_update op which fused
fused_gdn_gating+fused_recurrent_gated_delta_rule

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
											
										
										
											2025-12-19 16:34:11 +08:00
+								#    Why:
 								#       triton ops fused_recurrent_gated_delta_rule and fused_gdn_gating in vLLM perform not good on NPU.
 								#    How：
 								#       add a new fused triton ops in vLLM with ascend implementation.
 								#    Related PR (if no, explain why):
 								#       https://github.com/vllm-project/vllm/pull/30860
 								#    Future Plan:
 								#       Remove this patch when vLLM support these operators.
 								#
-												[Patch][Misc] Cleanup and update patches (#6802)

### What this PR does / why we need it?

This PR performs a cleanup and update of the patch mechanism in
`vllm-ascend`.

- Removes several obsolete patches: `patch_deepseek.py`.
- Updates the central patch documentation in
`vllm_ascend/patch/__init__.py` to reflect these removals and additions,
re-numbering and re-organizing the patch list for better clarity.

### Does this PR introduce _any_ user-facing change?

No. These are internal changes to the patching mechanism and should not
affect users.

### How was this patch tested?

CI passed with new added/existing test.

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-26 14:45:33 +08:00
+								#   3. `vllm.model_executor.models.qwen3_next.Qwen3NextGatedDeltaNet._forward_core`
-												[task] Add fused gdn gating triton kernel (#4304)

### What this PR does / why we need it?
This commit introduces a Triton-based fused GDN gating kernel for Ascend
NPU, aimed at improving performance in the Gated Delta Net workflow.
### Does this PR introduce _any_ user-facing change?
It only adds and refactors internal Triton kernels and wrappers for
Ascend. These are backend implementation details. There are no new APIs,
flags, CLI options, or behavior changes visible to end users.
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: Ascendyh <hw7osiris@outlook.com>
											
										
										
											2025-12-22 14:09:19 +08:00
+								#    Why:
 								#       The Qwen3Next GatedDeltaNet _forward_core cannot directly add custom operators.
 								#    How：
 								#       Add a branch in Qwen3NextGatedDeltaNet._forward_core to adapt to fused_gdn_gating_patch.
 								#    Related PR (if no, explain why):
 								#       https://github.com/vllm-project/vllm/pull/31002
 								#    Future Plan:
 								#       Remove this patch when vLLM support these operators.
 								#
-												[Patch][Misc] Cleanup and update patches (#6802)

### What this PR does / why we need it?

This PR performs a cleanup and update of the patch mechanism in
`vllm-ascend`.

- Removes several obsolete patches: `patch_deepseek.py`.
- Updates the central patch documentation in
`vllm_ascend/patch/__init__.py` to reflect these removals and additions,
re-numbering and re-organizing the patch list for better clarity.

### Does this PR introduce _any_ user-facing change?

No. These are internal changes to the patching mechanism and should not
affect users.

### How was this patch tested?

CI passed with new added/existing test.

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-26 14:45:33 +08:00
+								# ** 9. File: worker/patch_huanyuan_vl.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.transformers_utils.processors.hunyuan_vl.HunYuanVLProcessor.__call__`
 								#    Why:
 								#       The `add_special_tokens` parameter is not supported by default in the processor.
 								#    How：
 								#       Remove the `add_special_tokens` parameter from kwargs before calling the original method.
 								#    Future Plan:
 								#       Remove this patch when vLLM aligns with the latest processor implementation.
 								#
-												adapt to main2main for model runner v2 (#7578)

### What this PR does / why we need it?
This PR aims to adapt to newest commit of vllm main branch for model
runner v2. please refer to
https://github.com/vllm-project/vllm-ascend/issues/5208
### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

- vLLM version: v0.18.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ed359c497a728f08b5b41456c07a688ccd510fbc

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
											
										
										
											2026-03-25 09:08:44 +08:00
+								# ** 10. File: worker/patch_v2/patch_eagle.py**
-												[Feature] implement eagle spec decoding for model runner v2 (#5840)

### What this PR does / why we need it?
this pr implement eagle spec decoding for model runner v2, please see
RFC https://github.com/vllm-project/vllm-ascend/issues/5208

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
vLLM version: v0.13.0

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
											
										
										
											2026-01-14 09:18:05 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.v1.worker.gpu.spec_decode.eagle.EagleSpeculator.propose`
 								#    Why:
 								#       `propose` method use torch.gather, but the gather operator will
 								#       pollute the arguments passed to it. the bug is reported to huawei
 								#       CANN team, but not fixed yet.
 								#    How：
 								#       clone the out attribute ahead of gather to avoid the bug.
 								#    Future Plan:
 								#       Remove this patch when cann fix the gather bug.
 								#
-												[Patch][Misc] Cleanup and update patches (#6802)

### What this PR does / why we need it?

This PR performs a cleanup and update of the patch mechanism in
`vllm-ascend`.

- Removes several obsolete patches: `patch_deepseek.py`.
- Updates the central patch documentation in
`vllm_ascend/patch/__init__.py` to reflect these removals and additions,
re-numbering and re-organizing the patch list for better clarity.

### Does this PR introduce _any_ user-facing change?

No. These are internal changes to the patching mechanism and should not
affect users.

### How was this patch tested?

CI passed with new added/existing test.

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-26 14:45:33 +08:00
+								# ** 11. File: worker/patch_unquantized_gemm.py**
-												[Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (#5034)

This PR add `MatmulAllreduceRmsnorm` operator and introduces a graph
fusion pass for `matmul_allreduce_rmsnorm` operations. The
implementation includes a new configuration flag, a pattern matching
pass using `torch._inductor.pattern_matcher`.

Co-authored-by: Trunrain [270250579@qq.com](mailto:270250579@qq.com)

- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9

---------

Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: tongrunze <t00574058@china.huawei.com>
											
										
										
											2026-01-19 09:28:07 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.model_executor.layers.utils.default_unquantized_gemm`
 								#    Why:
 								#       unquantized_gemm in vLLM not is a customop, which will make MatmulAllReduceAddRMSNorm fusion failure.
 								#    How：
 								#       make unquantized_gemm as a customop.
 								#    Future Plan:
 								#       Remove this patch when vLLM support the operator as customop.
-												[npugraph_ex]enable npugraph_ex by default (#6664)

### What this PR does / why we need it?

This pull request enables the `npugraph_ex` backend by default to
improve performance on Ascend NPUs, as proposed in the
[RFC](https://github.com/vllm-project/vllm-ascend/issues/6214).


### Does this PR introduce _any_ user-facing change?

Yes. `npugraph_ex` is now enabled by default. Users can disable it by
setting `enable: false` in the `npugraph_ex_config` section of the
`additional_config`.

### How was this patch tested?

CI passed. The changes are covered by existing and new E2E tests
(`test_aclgraph_accuracy.py`) and unit tests (`test_ascend_config.py`)
that have been updated to reflect the new default behavior. The tests
verify correctness and consistency with `npugraph_ex` enabled and
disabled, as well as with the new static kernel option.

Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com>
Co-authored-by: huyuanquan1 <huyuanquan1@huawei.com>
											
										
										
											2026-02-12 08:44:06 +08:00
+								#
-												[Patch][Misc] Cleanup and update patches (#6802)

### What this PR does / why we need it?

This PR performs a cleanup and update of the patch mechanism in
`vllm-ascend`.

- Removes several obsolete patches: `patch_deepseek.py`.
- Updates the central patch documentation in
`vllm_ascend/patch/__init__.py` to reflect these removals and additions,
re-numbering and re-organizing the patch list for better clarity.

### Does this PR introduce _any_ user-facing change?

No. These are internal changes to the patching mechanism and should not
affect users.

### How was this patch tested?

CI passed with new added/existing test.

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-26 14:45:33 +08:00
+								# ** 12. File: worker/patch_npugraph_ex_triton.py**
-												[npugraph_ex]enable npugraph_ex by default (#6664)

### What this PR does / why we need it?

This pull request enables the `npugraph_ex` backend by default to
improve performance on Ascend NPUs, as proposed in the
[RFC](https://github.com/vllm-project/vllm-ascend/issues/6214).


### Does this PR introduce _any_ user-facing change?

Yes. `npugraph_ex` is now enabled by default. Users can disable it by
setting `enable: false` in the `npugraph_ex_config` section of the
`additional_config`.

### How was this patch tested?

CI passed. The changes are covered by existing and new E2E tests
(`test_aclgraph_accuracy.py`) and unit tests (`test_ascend_config.py`)
that have been updated to reflect the new default behavior. The tests
verify correctness and consistency with `npugraph_ex` enabled and
disabled, as well as with the new static kernel option.

Signed-off-by: huyuanquan1 <huyuanquan1@huawei.com>
Co-authored-by: huyuanquan1 <huyuanquan1@huawei.com>
											
										
										
											2026-02-12 08:44:06 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `torchair.core._concrete_graph.ValuePack`,
 								#      `torchair.npu_fx_compiler._unpack_meta`,
 								#      `torchair.npu_fx_compiler._NpuGraphConverter._unpack_npu`
 								#    Why:
 								#       In the Triton scenario, npugraph_ex backend needs to process the value pack of the input parameters.
 								#    How：
 								#       Supplement the relevant processing logic through patches.
 								#    Related PR (if no, explain why):
 								#       https://gitcode.com/Ascend/torchair/pull/2575
 								#    Future Plan:
 								#       Remove this patch when the PTA version used by vllm-ascend has been upgraded.
-												[Patch][Misc] Cleanup and update patches (#6802)

### What this PR does / why we need it?

This PR performs a cleanup and update of the patch mechanism in
`vllm-ascend`.

- Removes several obsolete patches: `patch_deepseek.py`.
- Updates the central patch documentation in
`vllm_ascend/patch/__init__.py` to reflect these removals and additions,
re-numbering and re-organizing the patch list for better clarity.

### Does this PR introduce _any_ user-facing change?

No. These are internal changes to the patching mechanism and should not
affect users.

### How was this patch tested?

CI passed with new added/existing test.

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-26 14:45:33 +08:00
+								#
-												adapt to main2main for model runner v2 (#7578)

### What this PR does / why we need it?
This PR aims to adapt to newest commit of vllm main branch for model
runner v2. please refer to
https://github.com/vllm-project/vllm-ascend/issues/5208
### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

- vLLM version: v0.18.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ed359c497a728f08b5b41456c07a688ccd510fbc

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
											
										
										
											2026-03-25 09:08:44 +08:00
+								# ** 13. File: worker/patch_v2/patch_uva.py**
-												[Feature] adapt to uva buffer and main2main (#6657)

### What this PR does / why we need it?
vllm model runner v2 use uva buffer to prepare input data, but npu
doesn't support uva yet, this pr implement a uvawrapper class to mimic
gpu's uva backend. what's more, this pr make some modifications to adapt
to the newer main branch.

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
- vLLM main:
https://github.com/vllm-project/vllm/commit/13397841ab469cecf1ed425c3f52a9ffc38139b5

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
											
										
										
											2026-02-12 10:36:31 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.v1.worker.gpu.states.UvaBuffer`
 								#    Why:
 								#       ASCEND NPUs do not support UVA yet, so we need to wrap it in vLLM.
 								#    How：
 								#       make UvaBuffer a dummy class, mimic the interface of vllm UvaBuffer.
 								#    Future Plan:
 								#       Remove this patch when NPU support UVA.
-												[Patch][Misc] Cleanup and update patches (#6802)

### What this PR does / why we need it?

This PR performs a cleanup and update of the patch mechanism in
`vllm-ascend`.

- Removes several obsolete patches: `patch_deepseek.py`.
- Updates the central patch documentation in
`vllm_ascend/patch/__init__.py` to reflect these removals and additions,
re-numbering and re-organizing the patch list for better clarity.

### Does this PR introduce _any_ user-facing change?

No. These are internal changes to the patching mechanism and should not
affect users.

### How was this patch tested?

CI passed with new added/existing test.

- vLLM version: v0.15.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/83b47f67b1dfad505606070ae4d9f83e50ad4ebd

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2026-02-26 14:45:33 +08:00
+								#
 								# ** 14. File: worker/patch_kimi_k25.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.model_executor.models.kimi_k25_vit.Learnable2DInterpPosEmbDivided_fixed.forward`
 								#    Why:
 								#       The forward method uses interpolate with ops not supported on NPU.
 								#    How：
 								#       Replace with a new forward that uses CPU for interpolate when shape mismatch,
 								#       and use get_rope_shape to handle the rope shape interpolation.
 								#    Future Plan:
 								#       Remove this patch when vLLM aligns with the latest main.
 								#
 								# ** 15. File: worker/patch_routed_experts_capturer.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.model_executor.layers.fused_moe.routed_experts_capturer.RoutedExpertsCapturer.init_buffer`
 								#    Why:
 								#       The `_device_buffer` initialization in vLLM uses `device="cuda"` hardcoded,
 								#       which doesn't work on NPU.
 								#    How：
 								#       Replace `device="cuda"` with `device=current_platform.device_name` to support NPU.
 								#    Related PR (if no, explain why):
 								#       https://github.com/vllm-project/vllm/pull/34336
 								#    Future Plan:
 								#       Remove this patch when vLLM merges the PR.
-												[main][feature] Support quarot for eagle3 without embedding (#7038)

### What this PR does / why we need it?
If some `eagle3` model without embed_tokens works with `quarot` target
model, the acceptence rate will drop.
We solve it in this PR.
The relative vllm pr is https://github.com/vllm-project/vllm/pull/36225.

- vLLM main:
https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d

Signed-off-by: drslark <slarksblood@qq.com>
											
										
										
											2026-03-09 10:43:06 +08:00
+								# ** 16. File: worker/patch_draft_quarot.py**
-												[Bugfix] Fix the acceptance rates dorp issue when applying eagle3 to QuaRot model (#6914)

### What this PR does / why we need it?
When using the target model after rotational quantization, the
acceptance rate decreases because the fc weight of the draft model has
not undergone rotational quantization(issue: #6445). We fixed this issue
by performing rotation quantization on the fc weight of the draft model
in the same way as the main model when loading draft model.

- vLLM version: v0.16.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/15d76f74e2fdb12a95ea00f0ca283acf6219a2b7

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
											
										
										
											2026-03-04 11:29:49 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.model_executor.models.llama_eagle3.Eagle3LlamaForCausalLM.load_weights`
 								#    Why:
 								#       vllm-ascend reused the loading logic of drafter model from vllm,
 								#       but vllm doesn't need to apply to Ascend quantization.
 								#    How：
 								#       Dynamically replace the `load_weights` function at runtime,
 								#       and fix `target_config` into the new implementation with a closure.
-												[main][feature] Support quarot for eagle3 without embedding (#7038)

### What this PR does / why we need it?
If some `eagle3` model without embed_tokens works with `quarot` target
model, the acceptence rate will drop.
We solve it in this PR.
The relative vllm pr is https://github.com/vllm-project/vllm/pull/36225.

- vLLM main:
https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d

Signed-off-by: drslark <slarksblood@qq.com>
											
										
										
											2026-03-09 10:43:06 +08:00
+								#    Related PR (if no, explain why):
 								#       https://github.com/vllm-project/vllm/pull/36225
-												[Bugfix] Fix the acceptance rates dorp issue when applying eagle3 to QuaRot model (#6914)

### What this PR does / why we need it?
When using the target model after rotational quantization, the
acceptance rate decreases because the fc weight of the draft model has
not undergone rotational quantization(issue: #6445). We fixed this issue
by performing rotation quantization on the fc weight of the draft model
in the same way as the main model when loading draft model.

- vLLM version: v0.16.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/15d76f74e2fdb12a95ea00f0ca283acf6219a2b7

Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
											
										
										
											2026-03-04 11:29:49 +08:00
+								#    Future Plan:
 								#       Remove this patch when vLLM merges the PR.
-												Add patch_qwen3_5 for triton ops fused_recurrent_gated_delta_rule (#7109)

### What this PR does / why we need it?

The ops `torch_npu.npu_recurrent_gated_delta_rule` currently does not
support `ssm_state` inputs in float32 format,
we temporarily retain the _forward_core implementation with triton for
Qwen3_5

---------

Signed-off-by: pppeng <zepengliu912@qq.com>
Signed-off-by: pppeng <60355449+ppppeng@users.noreply.github.com>
											
										
										
											2026-03-10 23:28:58 +08:00
+								#
-												[Model] Support Minimax-m2.5 on NPU (#7105)

### What this PR does / why we need it?

Initial version to support minimax-m2.5 on vllm-ascend. 
This commit coverting original fp8 weight to a quantilized bf16 to
support Minimax-m2.5 on NPU.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d

### Test Report
Self tested precision summary, where the official precision score of
AIME2025 is 86.3
<img width="426" height="84" alt="image"
src="https://github.com/user-attachments/assets/a3ce2452-92fa-4713-962e-862248e0b61a"
/>

---------

Signed-off-by: limuyuan <limuyuan3@huawei.com>
Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com>
Co-authored-by: limuyuan <limuyuan3@huawei.com>
											
										
										
											2026-03-11 00:12:02 +08:00
+								# ** 17. File: worker/patch_minimax_m2.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.model_executor.models.minimax_m2.MiniMaxM2MoE.forward`
 								#    Why:
 								#       In TP mode, MiniMax-M2 MoE needs a backend-aware reduction path to avoid
 								#       unnecessary communication / maintain correctness on NPU.
 								#    How：
 								#       Replace the forward to call `experts.maybe_all_reduce_tensor_model_parallel`
 								#       when `tp_size > 1`.
 								#    Related PR (if no, explain why):
 								#       No, model-specific behavior.
 								#    Future Plan:
 								#       Move this behavior upstream once a generic MoE reduce hook exists.
 								#
 								#   2. `vllm.model_executor.models.minimax_m2.MiniMaxM2Attention.__init__`
 								#    Why:
 								#       When total kv heads < TP world size, kv head replication happens and k_norm
 								#       weights should be sharded to match the replication layout.
 								#    How：
 								#       Add `num_kv_head_replicas` and create sharded `k_norm` via
 								#       `MiniMaxText01RMSNormTP(..., weight_shard_world_size=total_num_kv_heads, ...)`.
 								#    Related PR (if no, explain why):
 								#       No, depends on Ascend kernel behavior and TP layout.
 								#    Future Plan:
 								#       Remove this patch if upstream implements kv-head-aware norm sharding.
 								#
 								#   3. `vllm.model_executor.models.minimax_m2.MiniMaxM2Model.load_weights`
 								#    Why:
 								#       MiniMax-M2 fp8 checkpoints may store fp8 weights with per-block inverse
 								#       scales. On NPU we load bf16 weights by dequantizing at load time.
 								#    How：
 								#       Inject fp8 dequant helpers and wrap `load_weights` to convert fp8 weight +
 								#       `weight_scale_inv` pairs into bf16 blocks before delegating to upstream.
 								#    Related PR (if no, explain why):
 								#       No, fp8 load format and backend constraints are model/backend specific.
 								#    Future Plan:
 								#       Remove this patch when upstream supports MiniMax-M2 fp8 loading on NPU.
 								#
-												[v0.18.0] Apply Eagle3 to MiniMax-M2.5 (#7619) (#7714)

### What this PR does / why we need it?
Apply Eagle3 to MiniMax-M2.5 to increase model performance This will be
discard after Eagle3 weight for MiniMax-M2.5 releases and code change
accepted by official repo
https://github.com/vllm-project/vllm/pull/37512/changes
backport: #7619

- vLLM version: v0.18.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ed359c497a728f08b5b41456c07a688ccd510fbc

Signed-off-by: limuyuan <limuyuan3@huawei.com>
Co-authored-by: limuyuan <limuyuan3@huawei.com>
											
										
										
											2026-03-27 18:33:29 +08:00
+								#   4. `vllm.model_executor.models.minimax_m2.MiniMaxM2Model.forward`
 								#    Why:
 								#       Eagle3 speculative decoding needs auxiliary hidden states from specific
 								#       transformer layers of the target model.
 								#    How：
 								#       Extend `MiniMaxM2Model.forward` to optionally collect and return
 								#       `(final_hidden_states, aux_hidden_states)` when `aux_hidden_state_layers`
 								#       is set by the runtime.
 								#    Related PR (if no, explain why):
 								#       https://github.com/vllm-project/vllm/pull/37512
 								#    Future Plan:
 								#       Remove this patch once upstream MiniMax-M2 integrates Eagle3 support.
 								#
 								#   5. `vllm.model_executor.models.minimax_m2.MiniMaxM2ForCausalLM`
 								#    Why:
 								#       vLLM core uses SupportsEagle3-style methods to configure which layers
 								#       should emit auxiliary hidden states.
 								#    How：
 								#       Inject `set_aux_hidden_state_layers` and default-layer getters onto
 								#       `MiniMaxM2ForCausalLM` so vLLM can configure the target model.
 								#    Related PR (if no, explain why):
 								#       https://github.com/vllm-project/vllm/pull/37512
 								#    Future Plan:
 								#       Remove this patch once upstream provides these methods on the model.
 								#
-												[Model] Support Minimax-m2.5 on NPU (#7105)

### What this PR does / why we need it?

Initial version to support minimax-m2.5 on vllm-ascend. 
This commit coverting original fp8 weight to a quantilized bf16 to
support Minimax-m2.5 on NPU.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.16.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d

### Test Report
Self tested precision summary, where the official precision score of
AIME2025 is 86.3
<img width="426" height="84" alt="image"
src="https://github.com/user-attachments/assets/a3ce2452-92fa-4713-962e-862248e0b61a"
/>

---------

Signed-off-by: limuyuan <limuyuan3@huawei.com>
Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com>
Co-authored-by: limuyuan <limuyuan3@huawei.com>
											
										
										
											2026-03-11 00:12:02 +08:00
+								# ** 18. File: worker/patch_minimax_m2_linear_attn.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.model_executor.layers.mamba.linear_attn.MiniMaxText01RMSNormTP.__init__`
 								#      `vllm.model_executor.layers.mamba.linear_attn.MiniMaxText01RMSNormTP.weight_loader`
 								#    Why:
 								#       MiniMax-M2 linear attention RMSNorm needs weight sharding that can follow
 								#       TP layout (and sometimes kv-head replication) on NPU.
 								#    How：
 								#       Override `__init__` to parameterize weight shard world/rank and install a
 								#       sharded `weight_loader` implementation.
 								#    Related PR (if no, explain why):
 								#       No, upstream API surface differs across versions.
 								#    Future Plan:
 								#       Remove this patch when upstream exposes stable sharding hooks for this layer.
 								#
 								#   2. `vllm.model_executor.layers.mamba.linear_attn.MiniMaxText01RMSNormTP.forward_qk`
 								#      (or older `_normalize_qk`)
 								#    Why:
 								#       q/k norm for linear attention is performance-sensitive. On NPU, a fused
 								#       rms_norm kernel is faster and TP needs a global rstd correction.
 								#    How：
 								#       Replace q/k normalization with NPU rms_norm fast path and TP-global rstd
 								#       correction; fall back to upstream implementation on non-NPU.
 								#    Related PR (if no, explain why):
 								#       No, backend-specific optimization.
 								#    Future Plan:
 								#       Remove this patch when upstream adds a backend dispatch path for q/k norm.
 								#
 								# ** 19. File: worker/patch_qwen3_5.py**
-												Add patch_qwen3_5 for triton ops fused_recurrent_gated_delta_rule (#7109)

### What this PR does / why we need it?

The ops `torch_npu.npu_recurrent_gated_delta_rule` currently does not
support `ssm_state` inputs in float32 format,
we temporarily retain the _forward_core implementation with triton for
Qwen3_5

---------

Signed-off-by: pppeng <zepengliu912@qq.com>
Signed-off-by: pppeng <60355449+ppppeng@users.noreply.github.com>
											
										
										
											2026-03-10 23:28:58 +08:00
+								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.model_executor.models.qwen3_5.Qwen3_5GatedDeltaNet._forward_core`
 								#    Why:
 								#       The class Qwen3_5GatedDeltaNet reuse the `_forward_core` method of Qwen3NextGatedDeltaNet,
 								#       but the ascendC ops of Qwen3NextGatedDeltaNet do not support ssm_state with float32 format.
 								#    How：
 								#       patch Qwen3_5GatedDeltaNet._forward_core to use triton ops like `fused_recurrent_gated_delta_rule`.
 								#    Future Plan:
 								#       Remove this patch when all ops in _forward_core support both Qwen3_5 and Qwen3Next.
-												[main][bugfix] Fixed the problem of speculative decoding in FULL mode (#7148)

### What this PR does / why we need it?

Fixed the error of speculative decoding in FULL mode when `num_spec + 1`
not in `cudagraph_capture_sizes`.

Now, we can run speculative decoding in FULL mode, but with drafter as
eager.

It depends on https://github.com/vllm-project/vllm-ascend/pull/7144 .

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

Test code is shown as below:

```python
prompts = [
    "1.Who are you?",
    "2. Who are you?",
]

sampling_params = SamplingParams(temperature=0.0, top_p=0.95, top_k=40, max_tokens=200)
llm = LLM(
    model="/home/some-model/Meta-Llama-3.1-8B-Instruct",
    tensor_parallel_size=1,
    max_num_seqs=32,
    # enforce_eager=True,
    disable_log_stats=False,
    distributed_executor_backend="mp",
    gpu_memory_utilization=0.7,
    async_scheduling=True,

    speculative_config={
        "enforce_eager": True,
        "model": "/home/some-model/EAGLE3-LLaMA3.1-Instruct-8B",
        "disable_padded_drafter_batch": False,
        "method": "eagle3",
        "num_speculative_tokens": 2,
    },
    
    compilation_config={
        "cudagraph_mode": "FULL",
        "cudagraph_num_of_warmups": 1,
    },

    max_model_len=4096, 
    enable_prefix_caching=False,
)

outputs = llm.generate(prompts, sampling_params)
```

The result before:

```text
   File "/vllm-workspace/vllm/vllm/v1/cudagraph_dispatcher.py", line 140, in _create_padded_batch_descriptor
     assert num_tokens_padded % uniform_decode_query_len == 0
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 AssertionError
```

The result after:

```text
--------------------------------------------------
total_num_output_tokens: 400
num_drafts: 249
num_draft_tokens: 498
num_accepted_tokens: 149
mean acceptance length: 1.60
--------------------------------------------------
acceptance at token 0: 0.43
acceptance at token 1: 0.17
```

- vLLM version: v0.16.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d

Signed-off-by: drslark <slarksblood@qq.com>
											
										
										
											2026-03-12 14:51:12 +08:00
+								#
 								# ** 20. File: worker/patch_cudagraph.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.v1.cudagraph_dispatcher.CudagraphDispatcher._create_padded_batch_descriptor`
 								#    Why:
 								#       vllm's FULL mode will cause error, we use a patch to avoid it.
 								#       After that, FULL can be enable now.
 								#    How：
 								#       Dynamically replace the `_create_padded_batch_descriptor` function at runtime,
 								#       and change the condition of if.
 								#    Related PR (if no, explain why):
 								#       https://github.com/vllm-project/vllm/pull/34880
 								#    Future Plan:
 								#       Remove this patch when vLLM merges the PR.
-												[MTP][Bugfix] Fix GLM5-W8A8 precision issues caused by rotary quant MTP weights (#7139)

### What this PR does / why we need it?
When GLM5 target model uses rotary quant, the final hidden states passes
to MTP need to do an extra rotary.

- vLLM version: v0.16.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d
---------
Signed-off-by: Wangbingjie <wangbj1207@126.com>
Signed-off-by: wangbj127 <256472688+wangbj127@users.noreply.github.com>
											
										
										
											2026-03-12 20:01:24 +08:00
+								#
 								# ** 21. File: worker/patch_deepseek_mtp.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.model_executor.models.deepseek_v2.get_spec_layer_idx_from_weight_name` and
 								#      `vllm.model_executor.models.deepseek_mtp.get_spec_layer_idx_from_weight_name`
 								#    Why:
 								#       When GLM5 uses rotary quant in vllm-ascend, the MTP layer needs to load an extra weight
 								#       named `rot.weight`.
 								#    How：
 								#       If weight name starts with `rot`, return `layer_id + i` like other tensors in MTP layer.
 								#    Related PR (if no, explain why):
 								#       Rotary quant is a unique feature of vllm-ascend.
 								#    Future Plan:
 								#       Remove this patch when vllm supports rotary quant or pluggable `MultiTokenPredictorLayer`.
 								#   2. `vllm.model_executor.models.deepseek_mtp.DeepSeekMultiTokenPredictorLayer`
 								#    Why:
 								#       When GLM5 uses rotary quant in vllm-ascend, the `previous_hidden_states` does not .
 								#    How：
 								#       If the target model uses rotary quant, a new linear operation is added before `ehnorm`.
 								#    Related PR (if no, explain why):
 								#       Rotary quant is a unique feature of vllm-ascend.
 								#    Future Plan:
 								#       Remove this patch when vllm supports rotary quant or pluggable `MultiTokenPredictorLayer`.
 								#   3. `vllm.model_executor.models.deepseek_mtp.DeepSeekMTP._rewrite_spec_layer_name`
 								#    Why:
 								#       Rename `rot.weight` to match the format of weights in `DeepSeekMTP`.
 								#    How：
 								#       If the weight name is `rot`, rename it to `model.layers.{spec_layer}.rot.weight`.
 								#    Related PR (if no, explain why):
 								#       Rotary quant is a unique feature of vllm-ascend.
 								#    Future Plan:
 								#       Remove this patch when vllm supports rotary quant or pluggable `MultiTokenPredictorLayer`.
-												[Hybrid] support prefix cache for Qwen3.5/Next with `--mamba-cache-mode align` (#7103)

### What this PR does / why we need it?
To support prefix cache for Qwen3.5/Next in vLLM-Ascend, this PR mainly
follows the design in
[#30877](https://github.com/vllm-project/vllm/pull/30877) and inherits
changes to functions which are overridden in vLLM-Ascend.

Note:
1. `--mamba-cache-mode align` && PD disaggregation is still not
supported yet in vLLM v0.17.0(see
https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L295).
2. The current implementation of hybrid kv cache might result in a very
large block_size when scheduling. For example, if we run Qwen3.5-35B-A3B
with `-tp 2`, the block_size is adjusted to 2048, which means that any
prefix shorter than 2048 will never be cached. Although this behavior is
consistent with vLLM, it still needs improvements in the future.
3. `--mamba-cache-mode align` requires to copy mamba states during
forward steps. vLLM uses a triton kernel to implement it. However, the
original version run into some bugs on Ascend hardwares. Thus we patch a
new triton kernel to avoid this bug.

### Does this PR introduce _any_ user-facing change?
To use mamba prefix cache, set `--enable-prefix-caching` and
`--mamba-cache-mode align`. Note that the mamba state copy function(see
[do_mamba_copy_block](https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/mamba_utils.py#L132))
does not provide a torch native version, thus it might have trouble if
users can't use triton.

- vLLM version: v0.16.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d

---------

Signed-off-by: Angazenn <supperccell@163.com>
											
										
										
											2026-03-15 09:44:09 +08:00
+								# ** 22. File: worker/patch_mamba_utils.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.v1.worker.mamba_utils.batch_memcpy_kernel = batch_memcpy_kernel`
 								#    Why:
 								#       Oringnal batch_memcpy_kernel implemented in vLLM might encounter bugs when running on
 								#       Ascend hardwares.
 								#    How：
 								#       patch to fix related bugs.
 								#    Future Plan:
 								#       Remove this patch when:
 								#       (1) oringnal batch_memcpy_kernel can run on Ascend hardware.
 								#       or
 								#       (2) design a dispatch mechanism for batch_memcpy_kernel.
 								#   2. `vllm.v1.worker.mamba_utils.batch_memcpy = batch_memcpy`
 								#    Why:
 								#       vLLM use BLOCK_SIZE 1024 for batch_memcpy_kernel. This results in suboptimal performance
 								#       on Ascend hardwares.
 								#    How：
 								#       patch to change BLOCK_SIZE to 8192.
 								#    Future Plan:
 								#       Remove this patch when:
 								#       design a dispatch mechanism for batch_memcpy_kernel.
-												[Feature]Supports DSv3.1 PD separation and C8 quantization (#7222)

Co-authored-by: kunpengW-code <1289706727@qq.com>
Co-authored-by: linsheng1 <1950916997@qq.com>

### What this PR does / why we need it?
Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8
supports only the PD separation scenario. C8 refers to quantizing the KV
cache to int8, which aims to reduce the GPU memory usage of the KV cache
and improve the inference throughput.
Constraints: 
1. Only the PD separation mode can be used and
MooncakeLayerwiseConnector can be used to run the model.
2. Currently, only the activation value supports dynamic quantization,
and the KV cache supports static quantization. C8 quantization with MTP
is not supported. You can use ModelSlim for quantization. The
quantization procedure is as follows:
pip install transformers==4.48.2
git clone https://gitcode.com/Ascend/msmodelslim.git
cd msmodelslim
bash install.sh
cd example/DeepSeek/
python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path
<path/quant_weight>
--anti_dataset../common/deepseek_anti_prompt_50_v3_1.json
--calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot
--trust_remote_code True --fa_quant --dynamic --anti_method m6

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4034c3d32e30d01639459edd3ab486f56993876d

---------

Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
											
										
										
											2026-03-16 22:49:05 +08:00
+								#
 								# ** 23. File: worker/patch_weight_utils.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.model_executor.models.deepseek_v2.DeepseekV2ForCausalLM.load_weights`
 								#    Why:
 								#       The C8 weight quantized by modelslim will modify the model structure,
 								#       and the scale and offset required for kvcache quantization will increase.
 								#       In addition, the names of the quantization parameters are different from
 								#       those in the community.
 								#    How：
 								#       we have enhanced the maybe_remap_kv_scale_name function.
 								#    Future Plan:
 								#       The maybe_remap_kv_scale_name function of the community is reconstructed to support
 								#       multiple backends.
-												adapt to main2main for model runner v2 (#7578)

### What this PR does / why we need it?
This PR aims to adapt to newest commit of vllm main branch for model
runner v2. please refer to
https://github.com/vllm-project/vllm-ascend/issues/5208
### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

- vLLM version: v0.18.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ed359c497a728f08b5b41456c07a688ccd510fbc

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
											
										
										
											2026-03-25 09:08:44 +08:00
+								# ** 24. File: worker/patch_v2/patch_input_batch.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.v1.worker.gpu.input_batch.InputBatch`
 								#    Why:
 								#       vllm use InputBatch to make dummy tensors. in `model_runner.py` and `cudagraph_utils.py`
 								#       which make it difficult to inherit from vllm methods.
 								#    How：
 								#       replace InputBatch with AscendInputBatch.
 								#    Future Plan:
 								#       remove this patch when vLLM-ascend's make_dummy behavior aligns with vLLM.
 								# ** 25. File: worker/patch_v2/patch_block_table.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.v1.worker.gpu.block_table.BlockTables`
 								#    Why:
 								##      vllm-ascend need to initialize slot mapping as torch.int32 dtype,
 								#       but vllm default is torch.int64 dtype.
 								#    How：
 								#       replace BlockTables with AscendBlockTables which initialize slot mapping
 								#       as torch.int32 dtype.
 								#    Future Plan:
 								#       remove this patch when vLLM-ascend's BlockTables can initialize
 								#       slot mapping as torch.int64 dtype.
 								# ** 25. File: worker/patch_v2/patch_model_state.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.v1.worker.gpu.model_states.default.init_model_state`
 								#    Why:
 								##      vllm's prepare_attn in ModelState is different from vllm,
 								#       we need to override init_model_state.
 								#    How：
 								#       Define AscendModelState and initialize it in init_model_state.
 								#    Future Plan:
 								#       remove this when vllm-ascend's attention metadata is align with vllm.
 								# ** 26. File: worker/patch_v2/patch_triton.py**
 								# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 								#   1. `vllm.v1.worker.gpu.sample.logprob`, `vllm.v1.worker.gpu.sample.penalties.apply_penalties`,
 								#      `vllm.v1.worker.gpu.sample.gumbel.gumbel_sample`
 								#    Why:
 								#       triton ops in vLLM perform not good on NPU. And there is no dispatch mechanism for triton ops.
 								#    How：
 								#       override triton ops in vLLM with ascend implementation
 								#    Related PR (if no, explain why):
 								#       Let vLLM support triton ops dispatch.
 								#    Future Plan:
 								#       Remove this patch when vLLM support the dispatch function.
 								#