[Doc] Add new intro to MiniMax-M2.5/M2.7 (#8169)
### What this PR does / why we need it? 1. This PR cherry pick commit that contains current best performance at 3.5k/1.5k and 128k/1k from main to 0.18.0 branch. 2. This PR introduce MiniMax-M2.7 0day information to users. 3. To finish previous step we also changes MiniMax doc name from MiniMax-M2.5.md to MiniMax-M2.md --------- Signed-off-by: limuyuan <limuyuan3@huawei.com> Co-authored-by: limuyuan <limuyuan3@huawei.com>
This commit is contained in:
@@ -1,25 +1,37 @@
|
||||
# MiniMax-M2.5
|
||||
# MiniMax-M2
|
||||
|
||||
## Introduction
|
||||
|
||||
MiniMax‑M2.5 is MiniMax’s flagship large language model, reinforced for high‑value scenarios such as code generation, agentic tool calling/search, and complex office workflows, with an emphasis on reasoning efficiency and end‑to‑end speed on challenging tasks.
|
||||
|
||||
This document provides a unified deployment guide for `MiniMax-M2.5` on vLLM Ascend, covering both:
|
||||
MiniMax-M2.7 is MiniMax's first model deeply participating in its own evolution. M2.7 is capable of building complex agent harnesses and completing highly elaborate productivity tasks, leveraging Agent Teams, complex Skills, and dynamic tool search.
|
||||
|
||||
This document provides a unified deployment guide for `MiniMax-M2.5` and `MiniMax-M2.7` on vLLM Ascend, covering both:
|
||||
|
||||
- **A3 single-node** deployment (Atlas 800 A3)
|
||||
- **A2 dual-node** deployment (2× Atlas 800I A2)
|
||||
|
||||
## Supported Features
|
||||
|
||||
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
||||
|
||||
Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
### Model Weights
|
||||
|
||||
- `MiniMax-M2.5` (fp8 checkpoint): recommended to use **1× Atlas 800 A3** or **2× Atlas 800I A2** nodes. Download the model weights from [MiniMax/MiniMax-M2.5](https://modelscope.cn/models/MiniMax/MiniMax-M2.5).
|
||||
- `MiniMax-M2.5-w8a8-QuaRot` : Download the model weights from [Eco-Tech/MiniMax-M2.5-w8a8-QuaRot](https://modelscope.cn/models/Eco-Tech/MiniMax-M2.5-w8a8-QuaRot).
|
||||
- `Eagle3` : Download the model weights from [vllm-ascend/MiniMax-M2.5-eagel-model](https://modelscope.cn/models/vllm-ascend/MiniMax-M2.5-eagel-model-0318).
|
||||
- `MiniMax-M2.7` (fp8 checkpoint): recommended to use **1× Atlas 800 A3** or **2× Atlas 800I A2** nodes. Download the model weights from [MiniMax/MiniMax-M2.7](https://modelscope.cn/models/MiniMax/MiniMax-M2.7).
|
||||
- `MiniMax-M2.7-w8a8-QuaRot` : Download the model weights from [Eco-Tech/MiniMax-M2.7-w8a8-QuaRot](https://modelscope.cn/models/Eco-Tech/MiniMax-M2.7-w8a8-QuaRot).
|
||||
|
||||
It is recommended to download the model weights to a shared directory, such as `/mnt/sfs_turbo/.cache/`. The current release automatically detects the MiniMax-M2 fp8 checkpoint, disables fp8 quantization kernels on NPU, and loads the weights by dequantizing to bf16. This behavior may be removed once public bf16 weights are available.
|
||||
|
||||
### Installation
|
||||
|
||||
You can use the official docker image to run `MiniMax-M2.5` directly.
|
||||
You can use the official docker image to run `MiniMax-M2.5/M2.7` directly.
|
||||
|
||||
Select an image based on your machine type and start the container on your node. See [using docker](../../installation.md#set-up-using-docker).
|
||||
|
||||
@@ -112,41 +124,79 @@ docker run -itd -u 0 --ipc=host --privileged \
|
||||
|
||||
## Online Inference on Multi-NPU
|
||||
|
||||
### A3 (single node, tp=16)
|
||||
Below are recommended startup configurations for `MiniMax-M2.5`. Users can simply change weights and model name to run this startup configuration on `MiniMax-M2.7`. However it may not yet the best matchup for `MiniMax-M2.7` if one is trying to reach the best performance.
|
||||
|
||||
Below is a recommended startup configuration (default performance profile: full context + Tool Calling + Reasoning).
|
||||
### A3 (single node)
|
||||
|
||||
Below is a recommended startup configuration for short-context condition like 3.5k/1.5k on `MiniMax-M2.5` to reach a good performance. If you wish to run on long-context case, you may follow `Remarks` below to change your config.
|
||||
|
||||
Notes:
|
||||
|
||||
- By default, `--max-model-len` is not explicitly set. The server reads the model config (M2.5 uses `196608`) and enables verified performance parameters.
|
||||
- If you only care about short-context low latency, you can explicitly set `--max-model-len 32768`.
|
||||
- If you only care about short-context low latency, you can explicitly set `--max-model-len 32768`. You may also set `tensor-parallel-size` to 16 and set `data-parallel-size` to 1.
|
||||
- `export VLLM_ASCEND_BALANCE_SCHEDULING=1` is used to enhance scheduling capacity between prefill and decode. This will work remarkably with a lager `data-parallel-size`. This can increace performance when cuncurrency gets closer to values equals to `data-parallel-size` times `max-num-seqs`.
|
||||
- Running the current Eagle3 weights for `MiniMax-M2.7` yields no performance improvement; it is recommended to remove the `--speculative_config`.
|
||||
|
||||
```{code-block} bash
|
||||
cd /workspace
|
||||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||||
export HCCL_BUFFSIZE=1024
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export OMP_NUM_THREADS=1
|
||||
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
|
||||
sysctl -w vm.swappiness=0
|
||||
sysctl -w kernel.numa_balancing=0
|
||||
sysctl kernel.sched_migration_cost_ns=50000
|
||||
export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
|
||||
export TASK_QUEUE_ENABLE=1
|
||||
|
||||
export VLLM_ASCEND_ENABLE_FUSED_MC2=1
|
||||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||||
export VLLM_ASCEND_BALANCE_SCHEDULING=1
|
||||
|
||||
vllm serve /models/MiniMax-M2.5 \
|
||||
--served-model-name MiniMax-M2.5 \
|
||||
--trust-remote-code \
|
||||
--dtype bfloat16 \
|
||||
--tensor-parallel-size 16 \
|
||||
--enable-expert-parallel \
|
||||
--max-num-seqs 32 \
|
||||
--max-num-batched-tokens 32768 \
|
||||
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
|
||||
--enable-auto-tool-choice \
|
||||
--tool-call-parser minimax_m2 \
|
||||
--reasoning-parser minimax_m2_append_think \
|
||||
--port 8000 \
|
||||
> /tmp/minimax-m25-serve.log 2>&1 &
|
||||
|
||||
tail -f /tmp/minimax-m25-serve.log
|
||||
vllm serve /path/to/weight/MiniMax-M2.5-w8a8-QuaRot \
|
||||
--served-model-name "MiniMax-M2.5" \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--trust-remote-code \
|
||||
--quantization ascend \
|
||||
--async-scheduling \
|
||||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
|
||||
--additional-config '{"enable_cpu_binding":true}' \
|
||||
--enable-expert-parallel \
|
||||
--tensor-parallel-size 4 \
|
||||
--data-parallel-size 4 \
|
||||
--max-num-seqs 48 \
|
||||
--max-model-len 40690 \
|
||||
--max-num-batched-tokens 16384 \
|
||||
--gpu-memory-utilization 0.85 \
|
||||
--speculative_config '{"enforce_eager": true, "method": "eagle3", "model": "/path/to/weight/Eagle3/", "num_speculative_tokens": 3}' \
|
||||
```
|
||||
|
||||
Remarks:
|
||||
|
||||
- `minimax_m2_append_think` keeps `<think>...</think>` inside `content`.
|
||||
- If you mainly rely on the reasoning semantics of `/v1/responses`, it is recommended to use `--reasoning-parser minimax_m2` instead.
|
||||
- To receive a better performance on long-context like 128k or 64k, we recommend to do changes as shown below, and you can remove `export VLLM_ASCEND_BALANCE_SCHEDULING=1`.
|
||||
|
||||
```{code-block} bash
|
||||
--tensor-parallel-size 8 \
|
||||
--data-parallel-size 1 \
|
||||
--decode-context-parallel-size 1 \
|
||||
--prefill-context-parallel-size 2 \
|
||||
--cp-kv-cache-interleave-size 128 \
|
||||
--max-num-seqs 16 \
|
||||
--max-model-len 138000 \
|
||||
--max-num-batched-tokens 65536 \
|
||||
--gpu-memory-utilization 0.85 \
|
||||
--speculative_config '{"enforce_eager": true, "method": "eagle3", "model": "/path/to/weight/Eagle3/", "num_speculative_tokens": 1}' \
|
||||
```
|
||||
|
||||
- If you will to test with `curl` command, you can add following commands addition to start up command above.
|
||||
|
||||
```{code-block} bash
|
||||
--enable-auto-tool-choice \
|
||||
--tool-call-parser minimax_m2 \
|
||||
--reasoning-parser minimax_m2_append_think \
|
||||
```
|
||||
|
||||
### A2 (dual node, tp=8 + dp=2)
|
||||
|
||||
@@ -343,22 +393,21 @@ curl http://{PrimaryNodeIP}:20004/v1/chat/completions \
|
||||
}'
|
||||
```
|
||||
|
||||
## Performance Reference
|
||||
## Performance Reference (`MiniMax-M2.5`)
|
||||
|
||||
### A3 (single node, tp=16, 4k/1k@bs16)
|
||||
|
||||
#### Results
|
||||
|
||||
**Baseline** (`4k/1k@bs=16`)
|
||||
**Baseline** (`3.5k/1k@bs=217`)
|
||||
|
||||
| Metric | Result |
|
||||
| --- | --- |
|
||||
| Success/Failure | `16/0` |
|
||||
| Mean TTFT | `616.20 ms` |
|
||||
| Mean TPOT | `31.92 ms` |
|
||||
| Mean ITL | `31.92 ms` |
|
||||
| Output tok/s | `492.39` |
|
||||
| Total tok/s | `2461.95` |
|
||||
| Success/Failure | `217/0` |
|
||||
| Mean TTFT | `10316.56 ms` |
|
||||
| Mean TPOT | `34.28 ms` |
|
||||
| Output tok/s | `4803.81` |
|
||||
| Total tok/s | `16096.59` |
|
||||
|
||||
**Long-context reference** (`190k/1k@bs=4`)
|
||||
|
||||
@@ -32,5 +32,5 @@ GLM5.md
|
||||
Kimi-K2-Thinking.md
|
||||
Kimi-K2.5.md
|
||||
PaddleOCR-VL.md
|
||||
MiniMax-M2.5.md
|
||||
MiniMax-M2.md
|
||||
:::
|
||||
|
||||
@@ -28,6 +28,8 @@ Get the latest info here: <https://github.com/vllm-project/vllm-ascend/issues/16
|
||||
| GLM-4.x | 🔵 | || A2/A3 |✅|✅|✅||✅|✅|✅||✅|✅|✅|✅|✅|198k||[GLM-4.x](../../tutorials/models/GLM4.x.md)|
|
||||
| GLM-5 | 🔵 | | ✅ | A2/A3 | ✅ | ✅ | ✅ || ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | 198k || [GLM-5](../../tutorials/models/GLM5.md) |
|
||||
| Kimi-K2-Thinking | 🔵 | || A2/A3 |||||||||||||||| [Kimi-K2-Thinking](../../tutorials/models/Kimi-K2-Thinking.md) |
|
||||
| MiniMax-M2.5 | ✅ | | ✅ | A2/A3 |✅|✅|✅|❌|✅|✅|✅|🟡|✅|✅|✅|✅|✅|192k|🟡| [MiniMax-M2.5](../../tutorials/models/MiniMax-M2.md) |
|
||||
| MiniMax-M2.7 | ✅ | | ✅ | A2/A3 |✅|✅|✅|❌|🔵|✅|✅|🟡|✅|✅|🟡|✅|✅|192k|🟡| [MiniMax-M2.7](../../tutorials/models/MiniMax-M2.md) |
|
||||
|
||||
#### Extended Compatible Models
|
||||
|
||||
|
||||
Reference in New Issue
Block a user