[v0.18.0] Apply Eagle3 to MiniMax-M2.5 (#7619) (#7714)

### What this PR does / why we need it?
Apply Eagle3 to MiniMax-M2.5 to increase model performance This will be
discard after Eagle3 weight for MiniMax-M2.5 releases and code change
accepted by official repo
https://github.com/vllm-project/vllm/pull/37512/changes
backport: #7619

- vLLM version: v0.18.0
- vLLM main:
ed359c497a

Signed-off-by: limuyuan <limuyuan3@huawei.com>
Co-authored-by: limuyuan <limuyuan3@huawei.com>
This commit is contained in:
SparrowMu
2026-03-27 18:33:29 +08:00
committed by GitHub
parent 60e88d9541
commit 6fbd0049df
3 changed files with 312 additions and 24 deletions

View File

@@ -137,6 +137,38 @@
# Remove this patch if upstream provides an official NPU graph-capture
# guidance / auto-configuration path for HCCL.
#
# 3. `vllm.config.speculative.SpeculativeConfig._verify_args`
# Why:
# Upstream vLLM's eagle3/extract_hidden_states restricts target model types
# via a whitelist. MiniMax-M2 should be allowed once the worker-side model
# can emit auxiliary hidden states.
# How
# Monkey-patch `_verify_args` to bypass only the whitelist ValueError for
# MiniMax model_type when method is eagle3/extract_hidden_states.
# SpeculativeConfig is a Pydantic dataclass (`@config`); init validation calls
# `__pydantic_decorators__.model_validators["_verify_args"].func`, so that
# `Decorator.func` must be replaced (not only `SpeculativeConfig._verify_args`),
# then `rebuild_dataclass(SpeculativeConfig, force=True)`.
# If `VllmConfig` was imported earlier, also `rebuild_dataclass(VllmConfig, ...)`
# so nested `speculative_config` validation does not use a stale schema.
# Related PR (if no, explain why):
# https://github.com/vllm-project/vllm/pull/37512
# Future Plan:
# Remove this patch once upstream whitelist includes MiniMax.
#
# 4. `vllm.model_executor.models.registry` (spec decode aliases)
# Why:
# Some Eagle3 draft checkpoints may declare a MiniMax-specific architecture
# string while reusing the shared Eagle3 implementation.
# How
# Register `Eagle3MiniMaxM2ForCausalLM` as an alias pointing to the
# existing Eagle3 implementation in the speculative decoding registry.
# Related PR (if no, explain why):
# https://github.com/vllm-project/vllm/pull/37512
# Future Plan:
# Drop the alias once upstream registry includes it or the checkpoint
# standardizes architecture strings.
#
# ** 8. File: platform/patch_kv_cache_interface.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.v1.kv_cache_interface.MLAAttentionSpec`
@@ -453,6 +485,31 @@
# Future Plan:
# Remove this patch when upstream supports MiniMax-M2 fp8 loading on NPU.
#
# 4. `vllm.model_executor.models.minimax_m2.MiniMaxM2Model.forward`
# Why:
# Eagle3 speculative decoding needs auxiliary hidden states from specific
# transformer layers of the target model.
# How
# Extend `MiniMaxM2Model.forward` to optionally collect and return
# `(final_hidden_states, aux_hidden_states)` when `aux_hidden_state_layers`
# is set by the runtime.
# Related PR (if no, explain why):
# https://github.com/vllm-project/vllm/pull/37512
# Future Plan:
# Remove this patch once upstream MiniMax-M2 integrates Eagle3 support.
#
# 5. `vllm.model_executor.models.minimax_m2.MiniMaxM2ForCausalLM`
# Why:
# vLLM core uses SupportsEagle3-style methods to configure which layers
# should emit auxiliary hidden states.
# How
# Inject `set_aux_hidden_state_layers` and default-layer getters onto
# `MiniMaxM2ForCausalLM` so vLLM can configure the target model.
# Related PR (if no, explain why):
# https://github.com/vllm-project/vllm/pull/37512
# Future Plan:
# Remove this patch once upstream provides these methods on the model.
#
# ** 18. File: worker/patch_minimax_m2_linear_attn.py**
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. `vllm.model_executor.layers.mamba.linear_attn.MiniMaxText01RMSNormTP.__init__`