[DOC] MiniMax-M2.5 model intro (#7296)

### What this PR does / why we need it? 1. Add nightly test on MiniMax-M2.5 with deployment method on A3 2. Add MiniMax-M2.5 deployment introduction to vllm-ascend docs - vLLM version: v0.17.0 - vLLM main: 4034c3d32e --------- Signed-off-by: limuyuan <limuyuan3@huawei.com> Signed-off-by: SparrowMu <52023119+SparrowMu@users.noreply.github.com> Co-authored-by: limuyuan <limuyuan3@huawei.com>
2026-03-18 20:14:36 +08:00
parent 2916601e6c
commit fb8e22ec00
5 changed files with 488 additions and 1 deletions
--- a/.github/workflows/misc/model_list.json
+++ b/.github/workflows/misc/model_list.json
@@ -239,6 +239,7 @@
      "vllm-ascend/vllm-eagle-llama-68m-random",
      "wemaster/deepseek_mtp_main_random_bf16",
      "wemaster/deepseek_mtp_main_random_w8a8_part",
-      "xlangai/OpenCUA-7B"
+      "xlangai/OpenCUA-7B",
+      "MiniMax/MiniMax-M2.5"
    ]
  }
--- a/.github/workflows/schedule_nightly_test_a3.yaml
+++ b/.github/workflows/schedule_nightly_test_a3.yaml
@@ -261,6 +261,9 @@ jobs:
          - name: kimi-k2-thinking
            os: linux-aarch64-a3-16
            config_file_path: Kimi-K2-Thinking.yaml
+          - name: minimax-m2-5
+            os: linux-aarch64-a3-16
+            config_file_path: MiniMax-M2.5-A3.yaml
          - name: mtpx-deepseek-r1-0528-w8a8
            os: linux-aarch64-a3-16
            config_file_path: MTPX-DeepSeek-R1-0528-W8A8.yaml
--- a/docs/source/tutorials/models/MiniMax-M2.5.md
+++ b/docs/source/tutorials/models/MiniMax-M2.5.md
@@ -0,0 +1,436 @@
+# MiniMax-M2.5
+
+## Introduction
+
+MiniMax‑M2.5 is MiniMax’s flagship large language model, reinforced for high‑value scenarios such as code generation, agentic tool calling/search, and complex office workflows, with an emphasis on reasoning efficiency and end‑to‑end speed on challenging tasks.
+
+This document provides a unified deployment guide for `MiniMax-M2.5` on vLLM Ascend, covering both:
+
+- **A3 single-node** deployment (Atlas 800 A3)
+- **A2 dual-node** deployment (2× Atlas 800I A2)
+
+## Environment Preparation
+
+### Model Weights
+
+- `MiniMax-M2.5` (fp8 checkpoint): recommended to use **1× Atlas 800 A3** or **2× Atlas 800I A2** nodes. Download the model weights from [MiniMax/MiniMax-M2.5](https://modelscope.cn/models/MiniMax/MiniMax-M2.5).
+
+It is recommended to download the model weights to a shared directory, such as `/mnt/sfs_turbo/.cache/`. The current release automatically detects the MiniMax-M2 fp8 checkpoint, disables fp8 quantization kernels on NPU, and loads the weights by dequantizing to bf16. This behavior may be removed once public bf16 weights are available.
+
+### Installation
+
+You can use the official docker image to run `MiniMax-M2.5` directly.
+
+Select an image based on your machine type and start the container on your node. See [using docker](../../installation.md#set-up-using-docker).
+
+## Run with Docker
+
+### A3 (single node)
+
+```{code-block} bash
+   :substitutions:
+# Update the vllm-ascend image
+export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+export NAME=vllm-ascend
+
+# Run the container using the defined variables
+# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
+docker run --rm \
+--name $NAME \
+--net=host \
+--shm-size=1g \
+--device /dev/davinci0 \
+--device /dev/davinci1 \
+--device /dev/davinci2 \
+--device /dev/davinci3 \
+--device /dev/davinci4 \
+--device /dev/davinci5 \
+--device /dev/davinci6 \
+--device /dev/davinci7 \
+--device /dev/davinci8 \
+--device /dev/davinci9 \
+--device /dev/davinci10 \
+--device /dev/davinci11 \
+--device /dev/davinci12 \
+--device /dev/davinci13 \
+--device /dev/davinci14 \
+--device /dev/davinci15 \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /mnt/sfs_turbo/.cache:/home/cache \
+-it $IMAGE bash
+```
+
+### A2 (dual node, run on both nodes)
+
+Create and run `minimax25-docker-run.sh` on **both** A2 nodes.
+
+Notes:
+
+- The default configuration assumes an **Atlas 800I A2 8-NPU** node and sets `ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7`. Update it based on your hardware.
+- Map your model weight directory into the container (the example maps it to `/opt/data/verification/`).
+
+```{code-block} bash
+#!/bin/sh
+NAME=minimax2_5
+DEVICES="0,1,2,3,4,5,6,7"
+IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+
+docker run -itd -u 0 --ipc=host --privileged \
+  -e VLLM_USE_MODELSCOPE=True \
+  -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
+  -e ASCEND_RT_VISIBLE_DEVICES=$DEVICES \
+  --name $NAME \
+  --net=host \
+  --device /dev/davinci_manager \
+  --device /dev/devmm_svm \
+  --device /dev/hisi_hdc \
+  --shm-size=1200g \
+  -v /usr/local/dcmi:/usr/local/dcmi \
+  -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
+  -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+  -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+  -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+  -v /etc/ascend_install.info:/etc/ascend_install.info \
+  -v /home/:/home/ \
+  -v /opt/data/verification/:/opt/data/verification/ \   # Map the model weights here
+  -v /root/.cache:/root/.cache \
+  -v /mnt/performance/:/mnt/performance/ \
+  -it $IMAGE bash
+
+# Start and enter the container
+# bash minimax25-docker-run.sh
+# docker exec -it minimax2_5 bash
+```
+
+## Online Inference on Multi-NPU
+
+### A3 (single node, tp=16)
+
+Below is a recommended startup configuration (default performance profile: full context + Tool Calling + Reasoning).
+
+Notes:
+
+- By default, `--max-model-len` is not explicitly set. The server reads the model config (M2.5 uses `196608`) and enables verified performance parameters.
+- If you only care about short-context low latency, you can explicitly set `--max-model-len 32768`.
+
+```{code-block} bash
+cd /workspace
+export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
+
+vllm serve /models/MiniMax-M2.5 \
+  --served-model-name MiniMax-M2.5 \
+  --trust-remote-code \
+  --dtype bfloat16 \
+  --tensor-parallel-size 16 \
+  --enable-expert-parallel \
+  --max-num-seqs 32 \
+  --max-num-batched-tokens 32768 \
+  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
+  --enable-auto-tool-choice \
+  --tool-call-parser minimax_m2 \
+  --reasoning-parser minimax_m2_append_think \
+  --port 8000 \
+  > /tmp/minimax-m25-serve.log 2>&1 &
+
+tail -f /tmp/minimax-m25-serve.log
+```
+
+Remarks:
+
+- `minimax_m2_append_think` keeps `<think>...</think>` inside `content`.
+- If you mainly rely on the reasoning semantics of `/v1/responses`, it is recommended to use `--reasoning-parser minimax_m2` instead.
+
+### A2 (dual node, tp=8 + dp=2)
+
+Since cross-node tensor parallelism (TP) can be unstable, the dual-node guide uses a **tp=8 + dp=2** setup (8 NPUs per node, 16 NPUs total).
+
+#### Node0 (primary) startup script
+
+Edit `minimax25_service_node0.sh` inside the node0 container, and replace the placeholders with your actual values:
+
+- `{PrimaryNodeIP}`: the primary node's IP address (public/cluster network)
+- `{NIC}`: the NIC name for the public/cluster network (check via `ifconfig`, e.g., `enp67s0f0np0`)
+- `VLLM_TORCH_PROFILER_DIR`: optional, directory to store profiling outputs
+
+```{code-block} bash
+# Primary node (node0)
+export HCCL_IF_IP={PrimaryNodeIP}
+export GLOO_SOCKET_IFNAME="{NIC}"
+export TP_SOCKET_IFNAME="{NIC}"
+export HCCL_SOCKET_IFNAME="{NIC}"
+export HCCL_BUFFSIZE=1024
+export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+
+export HCCL_OP_EXPANSION_MODE="AIV"
+export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
+
+export OMP_PROC_BIND=false
+export OMP_NUM_THREADS=1
+export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
+
+export HCCL_INTRA_PCIE_ENABLE=1
+export HCCL_INTRA_ROCE_ENABLE=0
+
+# profiling (optional)
+export VLLM_TORCH_PROFILER_WITH_STACK=0
+export VLLM_TORCH_PROFILER_DIR="{profiling_dir}"
+
+vllm serve /opt/data/verification/models/MiniMax-M2.5/ \
+  --served-model-name "minimax25" \
+  --host {PrimaryNodeIP} \
+  --port 20004 \
+  --tensor-parallel-size 8 \
+  --data-parallel-size 2 \
+  --data-parallel-size-local 1 \
+  --data-parallel-start-rank 0 \
+  --data-parallel-address {PrimaryNodeIP} \
+  --data-parallel-rpc-port 2347 \
+  --max-num-seqs 128 \
+  --max-num-batched-tokens 65536 \
+  --gpu-memory-utilization 0.92 \
+  --enable-expert-parallel \
+  --trust-remote-code \
+  --enable-auto-tool-choice \
+  --tool-call-parser minimax_m2 \
+  --reasoning-parser minimax_m2_append_think \
+  --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
+  --mm_processor_cache_type="shm" \
+  --async-scheduling \
+  --additional-config '{"enable_cpu_binding":true}'
+```
+
+#### Node1 (secondary) startup script
+
+Edit `minimax25_service_node1.sh` inside the node1 container:
+
+- `{SecondaryNodeIP}`: the secondary node's IP address
+- `{PrimaryNodeIP}`: the primary node's IP address (same as node0)
+- `{NIC}`: same as above
+
+```{code-block} bash
+# Secondary node (node1)
+export HCCL_IF_IP={SecondaryNodeIP}
+export GLOO_SOCKET_IFNAME="{NIC}"
+export TP_SOCKET_IFNAME="{NIC}"
+export HCCL_SOCKET_IFNAME="{NIC}"
+export HCCL_BUFFSIZE=1024
+export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+
+export HCCL_OP_EXPANSION_MODE="AIV"
+export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
+
+export OMP_PROC_BIND=false
+export OMP_NUM_THREADS=1
+export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
+
+export HCCL_INTRA_PCIE_ENABLE=1
+export HCCL_INTRA_ROCE_ENABLE=0
+
+# profiling (optional)
+export VLLM_TORCH_PROFILER_WITH_STACK=0
+export VLLM_TORCH_PROFILER_DIR="{profiling_dir}"
+
+vllm serve /opt/data/verification/models/MiniMax-M2.5/ \
+  --served-model-name "minimax25" \
+  --host {SecondaryNodeIP} \
+  --port 20004 \
+  --headless \
+  --tensor-parallel-size 8 \
+  --data-parallel-size 2 \
+  --data-parallel-size-local 1 \
+  --data-parallel-start-rank 1 \
+  --data-parallel-address {PrimaryNodeIP} \
+  --data-parallel-rpc-port 2347 \
+  --max-num-seqs 128 \
+  --max-num-batched-tokens 65536 \
+  --gpu-memory-utilization 0.92 \
+  --enable-expert-parallel \
+  --trust-remote-code \
+  --enable-auto-tool-choice \
+  --tool-call-parser minimax_m2 \
+  --reasoning-parser minimax_m2_append_think \
+  --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
+  --mm_processor_cache_type="shm" \
+  --async-scheduling \
+  --additional-config '{"enable_cpu_binding":true}'
+```
+
+#### Startup order
+
+Start the service on both nodes:
+
+```{code-block} bash
+# node0
+bash minimax25_service_node0.sh
+
+# node1
+bash minimax25_service_node1.sh
+```
+
+After node0 prints `service start` in logs, you can verify the service.
+
+## Verify the Service
+
+### A3 (single node)
+
+Test with an OpenAI-compatible client:
+
+```{code-block} python
+from openai import OpenAI
+
+client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="na")
+
+resp = client.chat.completions.create(
+    model="MiniMax-M2.5",
+    messages=[{"role": "user", "content": "你好，请介绍一下你自己，并展示一次工具调用的参数格式。"}],
+    max_tokens=256,
+)
+print(resp.choices[0].message.content)
+```
+
+Or send a request using curl:
+
+```{code-block} bash
+curl http://127.0.0.1:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "MiniMax-M2.5",
+    "messages": [{"role": "user", "content": "请查询上海的天气。"}],
+    "tools": [{
+      "type": "function",
+      "function": {
+        "name": "get_current_weather",
+        "description": "Get weather by city",
+        "parameters": {
+          "type": "object",
+          "properties": {
+            "city": {"type": "string"},
+            "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
+          },
+          "required": ["city"]
+        }
+      }
+    }],
+    "tool_choice": "auto",
+    "temperature": 0,
+    "max_tokens": 512
+  }'
+```
+
+### A2 (dual node)
+
+Run the following from any machine that can reach the primary node (replace `{PrimaryNodeIP}` with the real IP):
+
+```{code-block} bash
+curl http://{PrimaryNodeIP}:20004/v1/chat/completions \
+  -H "Content-type: application/json" \
+  -d '{
+    "model": "minimax25",
+    "messages": [{"role": "user", "content": "Hello, who are you?"}],
+    "stream": false,
+    "ignore_eos": true,
+    "temperature": 0.8,
+    "top_p": 0.8,
+    "max_tokens": 200
+  }'
+```
+
+## Performance Reference
+
+### A3 (single node, tp=16, 4k/1k@bs16)
+
+#### Results
+
+**Baseline** (`4k/1k@bs=16`)
+
+| Metric | Result |
+| --- | --- |
+| Success/Failure | `16/0` |
+| Mean TTFT | `616.20 ms` |
+| Mean TPOT | `31.92 ms` |
+| Mean ITL | `31.92 ms` |
+| Output tok/s | `492.39` |
+| Total tok/s | `2461.95` |
+
+**Long-context reference** (`190k/1k@bs=4`)
+
+| Metric | Result |
+| --- | --- |
+| Output tok/s | `37.12` |
+| Mean TTFT | `2002.37 ms` |
+| Mean TPOT | `105.54 ms` |
+| Mean ITL | `105.54 ms` |
+
+### A2 (dual node, 190k/1k, concurrency=4, 16 prompts)
+
+#### Benchmark method
+
+Use vLLM bench for the **190k/1k, concurrency=4, 16 prompts** scenario:
+
+```{code-block} bash
+vllm bench serve --backend vllm \
+  --dataset-name prefix_repetition \
+  --prefix-repetition-prefix-len 175104 \   # Input: 190×1024 tokens with 90% prefix repetition
+  --prefix-repetition-suffix-len 19440 \    # Input: 190×1024 tokens minus the prefix length above
+  --prefix-repetition-output-len 1024 \     # Output: 1024 tokens
+  --prefix-repetition-num-prefixes 1 \
+  --num-prompts 16 \
+  --max-concurrency 4 \
+  --ignore-eos \
+  --model minimax25 \
+  --tokenizer {model_path} \
+  --endpoint /v1/completions \
+  --request-rate inf \
+  --seed 1000 \
+  --host {service_ip} \
+  --port 20004
+```
+
+#### Results
+
+**190k/1k, concurrency=4, 16 prompts**
+
+| Metric | Result |
+| --- | --- |
+| TTFT (avg) | 3305.25 ms |
+| TPOT (avg) | 109.83 ms |
+| Output throughput | 35.29 tok/s |
+| Prefix hit rate | 85% |
+
+## FAQ
+
+- **Q: What should I do if the output is garbled in EP mode?**
+
+  A: It is recommended to keep `--enable-expert-parallel` and `VLLM_ASCEND_ENABLE_FLASHCOMM1=1`.
+
+- **Q: Why is the `reasoning` field often empty after using `minimax_m2_append_think`?**
+
+  A: This is expected. The parser keeps `<think>...</think>` inside `content`. If you mainly rely on the reasoning semantics of `/v1/responses`, use `--reasoning-parser minimax_m2` instead.
+
+- **Q: Startup fails with HCCL port conflicts (address already bound). What should I do?**
+
+  A: Clean up old processes and restart: `pkill -f "vllm serve /models/MiniMax-M2.5"`.
+
+- **Q: How to handle OOM or unstable startup?**
+
+  A: Reduce `--max-num-seqs` and `--max-num-batched-tokens` first. If needed, reduce concurrency and load-testing pressure (e.g., `max-concurrency` / `num-prompts`).
+
+- **Q: Why not use cross-node tp=16?**
+
+  A: The referenced practice noted that cross-node TP may be unstable, so `tp=8, dp=2` is recommended for dual-node deployment.
+
+- **Q: How should I choose `--reasoning-parser`?**
+
+  A: This guide uses `minimax_m2_append_think` so that `<think>...</think>` is kept in `content`. If you mainly rely on the reasoning semantics of `/v1/responses`, consider using `--reasoning-parser minimax_m2`.
+
+- **Q: Which ports must be accessible?**
+
+  A: At minimum, expose the serving port (e.g., `20004`) and the data-parallel RPC port (e.g., `2347`), and ensure the two nodes can reach each other over the network.
--- a/docs/source/tutorials/models/index.md
+++ b/docs/source/tutorials/models/index.md
@@ -32,4 +32,5 @@ GLM5.md
 Kimi-K2-Thinking.md
 Kimi-K2.5.md
 PaddleOCR-VL.md
+MiniMax-M2.5.md
 :::
--- a/tests/e2e/nightly/single_node/models/configs/MiniMax-M2.5-A3.yaml
+++ b/tests/e2e/nightly/single_node/models/configs/MiniMax-M2.5-A3.yaml
@@ -0,0 +1,46 @@
+# ==========================================
+# ACTUAL TEST CASES
+# ==========================================
+
+test_cases:
+  - name: "MiniMax-M2.5-TP16-Reasoning-Tool"
+    model: "MiniMax/MiniMax-M2.5"
+    envs:
+      HCCL_BUFFSIZE: "1024"
+      OMP_PROC_BIND: "false"
+      HCCL_OP_EXPANSION_MODE: "AIV"
+      PYTORCH_NPU_ALLOC_CONF: "expandable_segments:True"
+      VLLM_ASCEND_ENABLE_FLASHCOMM1: "1"
+      SERVER_PORT: "DEFAULT_PORT"
+    prompts:
+      - "Hello. Please introduce yourself briefly."
+    api_keyword_args:
+      max_tokens: 128
+      temperature: 0
+    test_content:
+      - chat_completion
+    server_cmd:
+      - "--tensor-parallel-size"
+      - "16"
+      - "--port"
+      - "$SERVER_PORT"
+      - "--trust-remote-code"
+      - "--dtype"
+      - "bfloat16"
+      - "--enable-expert-parallel"
+      - "--max-num-seqs"
+      - "32"
+      - "--max-num-batched-tokens"
+      - "32768"
+      # Prefer a smaller max length for nightly stability. For full context,
+      # omit this flag and rely on the model config (196608).
+      - "--max-model-len"
+      - "32768"
+      - "--compilation-config"
+      - '{"cudagraph_mode":"FULL_DECODE_ONLY"}'
+      - "--enable-auto-tool-choice"
+      - "--tool-call-parser"
+      - "minimax_m2"
+      - "--reasoning-parser"
+      - "minimax_m2_append_think"
+