### What this PR does / why we need it?
1. Add nightly test on MiniMax-M2.5 with deployment method on A3
2. Add MiniMax-M2.5 deployment introduction to vllm-ascend docs
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: limuyuan <limuyuan3@huawei.com>
Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com>
Co-authored-by: limuyuan <limuyuan3@huawei.com>
437 lines
14 KiB
Markdown
437 lines
14 KiB
Markdown
# MiniMax-M2.5
|
||
|
||
## Introduction
|
||
|
||
MiniMax‑M2.5 is MiniMax’s flagship large language model, reinforced for high‑value scenarios such as code generation, agentic tool calling/search, and complex office workflows, with an emphasis on reasoning efficiency and end‑to‑end speed on challenging tasks.
|
||
|
||
This document provides a unified deployment guide for `MiniMax-M2.5` on vLLM Ascend, covering both:
|
||
|
||
- **A3 single-node** deployment (Atlas 800 A3)
|
||
- **A2 dual-node** deployment (2× Atlas 800I A2)
|
||
|
||
## Environment Preparation
|
||
|
||
### Model Weights
|
||
|
||
- `MiniMax-M2.5` (fp8 checkpoint): recommended to use **1× Atlas 800 A3** or **2× Atlas 800I A2** nodes. Download the model weights from [MiniMax/MiniMax-M2.5](https://modelscope.cn/models/MiniMax/MiniMax-M2.5).
|
||
|
||
It is recommended to download the model weights to a shared directory, such as `/mnt/sfs_turbo/.cache/`. The current release automatically detects the MiniMax-M2 fp8 checkpoint, disables fp8 quantization kernels on NPU, and loads the weights by dequantizing to bf16. This behavior may be removed once public bf16 weights are available.
|
||
|
||
### Installation
|
||
|
||
You can use the official docker image to run `MiniMax-M2.5` directly.
|
||
|
||
Select an image based on your machine type and start the container on your node. See [using docker](../../installation.md#set-up-using-docker).
|
||
|
||
## Run with Docker
|
||
|
||
### A3 (single node)
|
||
|
||
```{code-block} bash
|
||
:substitutions:
|
||
# Update the vllm-ascend image
|
||
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||
export NAME=vllm-ascend
|
||
|
||
# Run the container using the defined variables
|
||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
|
||
docker run --rm \
|
||
--name $NAME \
|
||
--net=host \
|
||
--shm-size=1g \
|
||
--device /dev/davinci0 \
|
||
--device /dev/davinci1 \
|
||
--device /dev/davinci2 \
|
||
--device /dev/davinci3 \
|
||
--device /dev/davinci4 \
|
||
--device /dev/davinci5 \
|
||
--device /dev/davinci6 \
|
||
--device /dev/davinci7 \
|
||
--device /dev/davinci8 \
|
||
--device /dev/davinci9 \
|
||
--device /dev/davinci10 \
|
||
--device /dev/davinci11 \
|
||
--device /dev/davinci12 \
|
||
--device /dev/davinci13 \
|
||
--device /dev/davinci14 \
|
||
--device /dev/davinci15 \
|
||
--device /dev/davinci_manager \
|
||
--device /dev/devmm_svm \
|
||
--device /dev/hisi_hdc \
|
||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||
-v /mnt/sfs_turbo/.cache:/home/cache \
|
||
-it $IMAGE bash
|
||
```
|
||
|
||
### A2 (dual node, run on both nodes)
|
||
|
||
Create and run `minimax25-docker-run.sh` on **both** A2 nodes.
|
||
|
||
Notes:
|
||
|
||
- The default configuration assumes an **Atlas 800I A2 8-NPU** node and sets `ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7`. Update it based on your hardware.
|
||
- Map your model weight directory into the container (the example maps it to `/opt/data/verification/`).
|
||
|
||
```{code-block} bash
|
||
#!/bin/sh
|
||
NAME=minimax2_5
|
||
DEVICES="0,1,2,3,4,5,6,7"
|
||
IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||
|
||
docker run -itd -u 0 --ipc=host --privileged \
|
||
-e VLLM_USE_MODELSCOPE=True \
|
||
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
|
||
-e ASCEND_RT_VISIBLE_DEVICES=$DEVICES \
|
||
--name $NAME \
|
||
--net=host \
|
||
--device /dev/davinci_manager \
|
||
--device /dev/devmm_svm \
|
||
--device /dev/hisi_hdc \
|
||
--shm-size=1200g \
|
||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||
-v /home/:/home/ \
|
||
-v /opt/data/verification/:/opt/data/verification/ \ # Map the model weights here
|
||
-v /root/.cache:/root/.cache \
|
||
-v /mnt/performance/:/mnt/performance/ \
|
||
-it $IMAGE bash
|
||
|
||
# Start and enter the container
|
||
# bash minimax25-docker-run.sh
|
||
# docker exec -it minimax2_5 bash
|
||
```
|
||
|
||
## Online Inference on Multi-NPU
|
||
|
||
### A3 (single node, tp=16)
|
||
|
||
Below is a recommended startup configuration (default performance profile: full context + Tool Calling + Reasoning).
|
||
|
||
Notes:
|
||
|
||
- By default, `--max-model-len` is not explicitly set. The server reads the model config (M2.5 uses `196608`) and enables verified performance parameters.
|
||
- If you only care about short-context low latency, you can explicitly set `--max-model-len 32768`.
|
||
|
||
```{code-block} bash
|
||
cd /workspace
|
||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||
|
||
vllm serve /models/MiniMax-M2.5 \
|
||
--served-model-name MiniMax-M2.5 \
|
||
--trust-remote-code \
|
||
--dtype bfloat16 \
|
||
--tensor-parallel-size 16 \
|
||
--enable-expert-parallel \
|
||
--max-num-seqs 32 \
|
||
--max-num-batched-tokens 32768 \
|
||
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
|
||
--enable-auto-tool-choice \
|
||
--tool-call-parser minimax_m2 \
|
||
--reasoning-parser minimax_m2_append_think \
|
||
--port 8000 \
|
||
> /tmp/minimax-m25-serve.log 2>&1 &
|
||
|
||
tail -f /tmp/minimax-m25-serve.log
|
||
```
|
||
|
||
Remarks:
|
||
|
||
- `minimax_m2_append_think` keeps `<think>...</think>` inside `content`.
|
||
- If you mainly rely on the reasoning semantics of `/v1/responses`, it is recommended to use `--reasoning-parser minimax_m2` instead.
|
||
|
||
### A2 (dual node, tp=8 + dp=2)
|
||
|
||
Since cross-node tensor parallelism (TP) can be unstable, the dual-node guide uses a **tp=8 + dp=2** setup (8 NPUs per node, 16 NPUs total).
|
||
|
||
#### Node0 (primary) startup script
|
||
|
||
Edit `minimax25_service_node0.sh` inside the node0 container, and replace the placeholders with your actual values:
|
||
|
||
- `{PrimaryNodeIP}`: the primary node's IP address (public/cluster network)
|
||
- `{NIC}`: the NIC name for the public/cluster network (check via `ifconfig`, e.g., `enp67s0f0np0`)
|
||
- `VLLM_TORCH_PROFILER_DIR`: optional, directory to store profiling outputs
|
||
|
||
```{code-block} bash
|
||
# Primary node (node0)
|
||
export HCCL_IF_IP={PrimaryNodeIP}
|
||
export GLOO_SOCKET_IFNAME="{NIC}"
|
||
export TP_SOCKET_IFNAME="{NIC}"
|
||
export HCCL_SOCKET_IFNAME="{NIC}"
|
||
export HCCL_BUFFSIZE=1024
|
||
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||
|
||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||
|
||
export OMP_PROC_BIND=false
|
||
export OMP_NUM_THREADS=1
|
||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||
|
||
export HCCL_INTRA_PCIE_ENABLE=1
|
||
export HCCL_INTRA_ROCE_ENABLE=0
|
||
|
||
# profiling (optional)
|
||
export VLLM_TORCH_PROFILER_WITH_STACK=0
|
||
export VLLM_TORCH_PROFILER_DIR="{profiling_dir}"
|
||
|
||
vllm serve /opt/data/verification/models/MiniMax-M2.5/ \
|
||
--served-model-name "minimax25" \
|
||
--host {PrimaryNodeIP} \
|
||
--port 20004 \
|
||
--tensor-parallel-size 8 \
|
||
--data-parallel-size 2 \
|
||
--data-parallel-size-local 1 \
|
||
--data-parallel-start-rank 0 \
|
||
--data-parallel-address {PrimaryNodeIP} \
|
||
--data-parallel-rpc-port 2347 \
|
||
--max-num-seqs 128 \
|
||
--max-num-batched-tokens 65536 \
|
||
--gpu-memory-utilization 0.92 \
|
||
--enable-expert-parallel \
|
||
--trust-remote-code \
|
||
--enable-auto-tool-choice \
|
||
--tool-call-parser minimax_m2 \
|
||
--reasoning-parser minimax_m2_append_think \
|
||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
|
||
--mm_processor_cache_type="shm" \
|
||
--async-scheduling \
|
||
--additional-config '{"enable_cpu_binding":true}'
|
||
```
|
||
|
||
#### Node1 (secondary) startup script
|
||
|
||
Edit `minimax25_service_node1.sh` inside the node1 container:
|
||
|
||
- `{SecondaryNodeIP}`: the secondary node's IP address
|
||
- `{PrimaryNodeIP}`: the primary node's IP address (same as node0)
|
||
- `{NIC}`: same as above
|
||
|
||
```{code-block} bash
|
||
# Secondary node (node1)
|
||
export HCCL_IF_IP={SecondaryNodeIP}
|
||
export GLOO_SOCKET_IFNAME="{NIC}"
|
||
export TP_SOCKET_IFNAME="{NIC}"
|
||
export HCCL_SOCKET_IFNAME="{NIC}"
|
||
export HCCL_BUFFSIZE=1024
|
||
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||
|
||
export HCCL_OP_EXPANSION_MODE="AIV"
|
||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||
|
||
export OMP_PROC_BIND=false
|
||
export OMP_NUM_THREADS=1
|
||
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
|
||
|
||
export HCCL_INTRA_PCIE_ENABLE=1
|
||
export HCCL_INTRA_ROCE_ENABLE=0
|
||
|
||
# profiling (optional)
|
||
export VLLM_TORCH_PROFILER_WITH_STACK=0
|
||
export VLLM_TORCH_PROFILER_DIR="{profiling_dir}"
|
||
|
||
vllm serve /opt/data/verification/models/MiniMax-M2.5/ \
|
||
--served-model-name "minimax25" \
|
||
--host {SecondaryNodeIP} \
|
||
--port 20004 \
|
||
--headless \
|
||
--tensor-parallel-size 8 \
|
||
--data-parallel-size 2 \
|
||
--data-parallel-size-local 1 \
|
||
--data-parallel-start-rank 1 \
|
||
--data-parallel-address {PrimaryNodeIP} \
|
||
--data-parallel-rpc-port 2347 \
|
||
--max-num-seqs 128 \
|
||
--max-num-batched-tokens 65536 \
|
||
--gpu-memory-utilization 0.92 \
|
||
--enable-expert-parallel \
|
||
--trust-remote-code \
|
||
--enable-auto-tool-choice \
|
||
--tool-call-parser minimax_m2 \
|
||
--reasoning-parser minimax_m2_append_think \
|
||
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
|
||
--mm_processor_cache_type="shm" \
|
||
--async-scheduling \
|
||
--additional-config '{"enable_cpu_binding":true}'
|
||
```
|
||
|
||
#### Startup order
|
||
|
||
Start the service on both nodes:
|
||
|
||
```{code-block} bash
|
||
# node0
|
||
bash minimax25_service_node0.sh
|
||
|
||
# node1
|
||
bash minimax25_service_node1.sh
|
||
```
|
||
|
||
After node0 prints `service start` in logs, you can verify the service.
|
||
|
||
## Verify the Service
|
||
|
||
### A3 (single node)
|
||
|
||
Test with an OpenAI-compatible client:
|
||
|
||
```{code-block} python
|
||
from openai import OpenAI
|
||
|
||
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="na")
|
||
|
||
resp = client.chat.completions.create(
|
||
model="MiniMax-M2.5",
|
||
messages=[{"role": "user", "content": "你好,请介绍一下你自己,并展示一次工具调用的参数格式。"}],
|
||
max_tokens=256,
|
||
)
|
||
print(resp.choices[0].message.content)
|
||
```
|
||
|
||
Or send a request using curl:
|
||
|
||
```{code-block} bash
|
||
curl http://127.0.0.1:8000/v1/chat/completions \
|
||
-H "Content-Type: application/json" \
|
||
-d '{
|
||
"model": "MiniMax-M2.5",
|
||
"messages": [{"role": "user", "content": "请查询上海的天气。"}],
|
||
"tools": [{
|
||
"type": "function",
|
||
"function": {
|
||
"name": "get_current_weather",
|
||
"description": "Get weather by city",
|
||
"parameters": {
|
||
"type": "object",
|
||
"properties": {
|
||
"city": {"type": "string"},
|
||
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
|
||
},
|
||
"required": ["city"]
|
||
}
|
||
}
|
||
}],
|
||
"tool_choice": "auto",
|
||
"temperature": 0,
|
||
"max_tokens": 512
|
||
}'
|
||
```
|
||
|
||
### A2 (dual node)
|
||
|
||
Run the following from any machine that can reach the primary node (replace `{PrimaryNodeIP}` with the real IP):
|
||
|
||
```{code-block} bash
|
||
curl http://{PrimaryNodeIP}:20004/v1/chat/completions \
|
||
-H "Content-type: application/json" \
|
||
-d '{
|
||
"model": "minimax25",
|
||
"messages": [{"role": "user", "content": "Hello, who are you?"}],
|
||
"stream": false,
|
||
"ignore_eos": true,
|
||
"temperature": 0.8,
|
||
"top_p": 0.8,
|
||
"max_tokens": 200
|
||
}'
|
||
```
|
||
|
||
## Performance Reference
|
||
|
||
### A3 (single node, tp=16, 4k/1k@bs16)
|
||
|
||
#### Results
|
||
|
||
**Baseline** (`4k/1k@bs=16`)
|
||
|
||
| Metric | Result |
|
||
| --- | --- |
|
||
| Success/Failure | `16/0` |
|
||
| Mean TTFT | `616.20 ms` |
|
||
| Mean TPOT | `31.92 ms` |
|
||
| Mean ITL | `31.92 ms` |
|
||
| Output tok/s | `492.39` |
|
||
| Total tok/s | `2461.95` |
|
||
|
||
**Long-context reference** (`190k/1k@bs=4`)
|
||
|
||
| Metric | Result |
|
||
| --- | --- |
|
||
| Output tok/s | `37.12` |
|
||
| Mean TTFT | `2002.37 ms` |
|
||
| Mean TPOT | `105.54 ms` |
|
||
| Mean ITL | `105.54 ms` |
|
||
|
||
### A2 (dual node, 190k/1k, concurrency=4, 16 prompts)
|
||
|
||
#### Benchmark method
|
||
|
||
Use vLLM bench for the **190k/1k, concurrency=4, 16 prompts** scenario:
|
||
|
||
```{code-block} bash
|
||
vllm bench serve --backend vllm \
|
||
--dataset-name prefix_repetition \
|
||
--prefix-repetition-prefix-len 175104 \ # Input: 190×1024 tokens with 90% prefix repetition
|
||
--prefix-repetition-suffix-len 19440 \ # Input: 190×1024 tokens minus the prefix length above
|
||
--prefix-repetition-output-len 1024 \ # Output: 1024 tokens
|
||
--prefix-repetition-num-prefixes 1 \
|
||
--num-prompts 16 \
|
||
--max-concurrency 4 \
|
||
--ignore-eos \
|
||
--model minimax25 \
|
||
--tokenizer {model_path} \
|
||
--endpoint /v1/completions \
|
||
--request-rate inf \
|
||
--seed 1000 \
|
||
--host {service_ip} \
|
||
--port 20004
|
||
```
|
||
|
||
#### Results
|
||
|
||
**190k/1k, concurrency=4, 16 prompts**
|
||
|
||
| Metric | Result |
|
||
| --- | --- |
|
||
| TTFT (avg) | 3305.25 ms |
|
||
| TPOT (avg) | 109.83 ms |
|
||
| Output throughput | 35.29 tok/s |
|
||
| Prefix hit rate | 85% |
|
||
|
||
## FAQ
|
||
|
||
- **Q: What should I do if the output is garbled in EP mode?**
|
||
|
||
A: It is recommended to keep `--enable-expert-parallel` and `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`.
|
||
|
||
- **Q: Why is the `reasoning` field often empty after using `minimax_m2_append_think`?**
|
||
|
||
A: This is expected. The parser keeps `<think>...</think>` inside `content`. If you mainly rely on the reasoning semantics of `/v1/responses`, use `--reasoning-parser minimax_m2` instead.
|
||
|
||
- **Q: Startup fails with HCCL port conflicts (address already bound). What should I do?**
|
||
|
||
A: Clean up old processes and restart: `pkill -f "vllm serve /models/MiniMax-M2.5"`.
|
||
|
||
- **Q: How to handle OOM or unstable startup?**
|
||
|
||
A: Reduce `--max-num-seqs` and `--max-num-batched-tokens` first. If needed, reduce concurrency and load-testing pressure (e.g., `max-concurrency` / `num-prompts`).
|
||
|
||
- **Q: Why not use cross-node tp=16?**
|
||
|
||
A: The referenced practice noted that cross-node TP may be unstable, so `tp=8, dp=2` is recommended for dual-node deployment.
|
||
|
||
- **Q: How should I choose `--reasoning-parser`?**
|
||
|
||
A: This guide uses `minimax_m2_append_think` so that `<think>...</think>` is kept in `content`. If you mainly rely on the reasoning semantics of `/v1/responses`, consider using `--reasoning-parser minimax_m2`.
|
||
|
||
- **Q: Which ports must be accessible?**
|
||
|
||
A: At minimum, expose the serving port (e.g., `20004`) and the data-parallel RPC port (e.g., `2347`), and ensure the two nodes can reach each other over the network.
|