### What this PR does / why we need it?
Apply Eagle3 to MiniMax-M2.5 to increase model performance This will be
discard after Eagle3 weight for MiniMax-M2.5 releases and code change
accepted by official repo
https://github.com/vllm-project/vllm/pull/37512/changes
backport: #7619
- vLLM version: v0.18.0
- vLLM main:
ed359c497a
Signed-off-by: limuyuan <limuyuan3@huawei.com>
Co-authored-by: limuyuan <limuyuan3@huawei.com>
This commit is contained in:
@@ -137,6 +137,38 @@
|
||||
# Remove this patch if upstream provides an official NPU graph-capture
|
||||
# guidance / auto-configuration path for HCCL.
|
||||
#
|
||||
# 3. `vllm.config.speculative.SpeculativeConfig._verify_args`
|
||||
# Why:
|
||||
# Upstream vLLM's eagle3/extract_hidden_states restricts target model types
|
||||
# via a whitelist. MiniMax-M2 should be allowed once the worker-side model
|
||||
# can emit auxiliary hidden states.
|
||||
# How:
|
||||
# Monkey-patch `_verify_args` to bypass only the whitelist ValueError for
|
||||
# MiniMax model_type when method is eagle3/extract_hidden_states.
|
||||
# SpeculativeConfig is a Pydantic dataclass (`@config`); init validation calls
|
||||
# `__pydantic_decorators__.model_validators["_verify_args"].func`, so that
|
||||
# `Decorator.func` must be replaced (not only `SpeculativeConfig._verify_args`),
|
||||
# then `rebuild_dataclass(SpeculativeConfig, force=True)`.
|
||||
# If `VllmConfig` was imported earlier, also `rebuild_dataclass(VllmConfig, ...)`
|
||||
# so nested `speculative_config` validation does not use a stale schema.
|
||||
# Related PR (if no, explain why):
|
||||
# https://github.com/vllm-project/vllm/pull/37512
|
||||
# Future Plan:
|
||||
# Remove this patch once upstream whitelist includes MiniMax.
|
||||
#
|
||||
# 4. `vllm.model_executor.models.registry` (spec decode aliases)
|
||||
# Why:
|
||||
# Some Eagle3 draft checkpoints may declare a MiniMax-specific architecture
|
||||
# string while reusing the shared Eagle3 implementation.
|
||||
# How:
|
||||
# Register `Eagle3MiniMaxM2ForCausalLM` as an alias pointing to the
|
||||
# existing Eagle3 implementation in the speculative decoding registry.
|
||||
# Related PR (if no, explain why):
|
||||
# https://github.com/vllm-project/vllm/pull/37512
|
||||
# Future Plan:
|
||||
# Drop the alias once upstream registry includes it or the checkpoint
|
||||
# standardizes architecture strings.
|
||||
#
|
||||
# ** 8. File: platform/patch_kv_cache_interface.py**
|
||||
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
# 1. `vllm.v1.kv_cache_interface.MLAAttentionSpec`
|
||||
@@ -453,6 +485,31 @@
|
||||
# Future Plan:
|
||||
# Remove this patch when upstream supports MiniMax-M2 fp8 loading on NPU.
|
||||
#
|
||||
# 4. `vllm.model_executor.models.minimax_m2.MiniMaxM2Model.forward`
|
||||
# Why:
|
||||
# Eagle3 speculative decoding needs auxiliary hidden states from specific
|
||||
# transformer layers of the target model.
|
||||
# How:
|
||||
# Extend `MiniMaxM2Model.forward` to optionally collect and return
|
||||
# `(final_hidden_states, aux_hidden_states)` when `aux_hidden_state_layers`
|
||||
# is set by the runtime.
|
||||
# Related PR (if no, explain why):
|
||||
# https://github.com/vllm-project/vllm/pull/37512
|
||||
# Future Plan:
|
||||
# Remove this patch once upstream MiniMax-M2 integrates Eagle3 support.
|
||||
#
|
||||
# 5. `vllm.model_executor.models.minimax_m2.MiniMaxM2ForCausalLM`
|
||||
# Why:
|
||||
# vLLM core uses SupportsEagle3-style methods to configure which layers
|
||||
# should emit auxiliary hidden states.
|
||||
# How:
|
||||
# Inject `set_aux_hidden_state_layers` and default-layer getters onto
|
||||
# `MiniMaxM2ForCausalLM` so vLLM can configure the target model.
|
||||
# Related PR (if no, explain why):
|
||||
# https://github.com/vllm-project/vllm/pull/37512
|
||||
# Future Plan:
|
||||
# Remove this patch once upstream provides these methods on the model.
|
||||
#
|
||||
# ** 18. File: worker/patch_minimax_m2_linear_attn.py**
|
||||
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
# 1. `vllm.model_executor.layers.mamba.linear_attn.MiniMaxText01RMSNormTP.__init__`
|
||||
|
||||
Reference in New Issue
Block a user