Files

SILONG ZENG a1f321a556 [Doc]Refresh model tutorial examples and serving commands (#7426 )

### What this PR does / why we need it?
Main updates include:
- update model IDs and default model paths in serving / offline
inference examples

- adjust some command snippets and notes for better copy-paste usability

- replace `SamplingParams` argument usage from `max_completion_tokens`
to `max_tokens`（**Offline** inference currently **does not support** the
"max_completion_tokens"）
``` bash
Traceback (most recent call last):
  File "/vllm-workspace/vllm-ascend/qwen-next.py", line 18, in <module>
    sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Unexpected keyword argument 'max_completion_tokens'
[ERROR] 2026-03-17-09:57:40 (PID:276, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
```

- refresh **Qwen3-Omni-30B-A3B-Thinking** recommended environment
variable
``` bash
export HCCL_BUFFSIZE=512
export HCCL_OP_EXPANSION_MODE=AIV
```
``` bash
EZ9999[PID: 25038] 2026-03-17-08:21:12.001.372 (EZ9999):  HCCL_BUFFSIZE is too SMALL, maxBs = 256, h = 2048, 
epWorldSize = 2, localMoeExpertNum = 64, sharedExpertNum = 0, tokenNeedSizeDispatch = 4608, tokenNeedSizeCombine 
= 4096, k = 8, NEEDED_HCCL_BUFFSIZE(((maxBs * tokenNeedSizeDispatch * ep_worldsize * localMoeExpertNum) + 
(maxBs * tokenNeedSizeCombine * (k + sharedExpertNum))) * 2) = 305MB, HCCL_BUFFSIZE=200MB.
[FUNC:CheckWinSize][FILE:moe_distribute_dispatch_v2_tiling.cpp][LINE:984]
```

- fix **Qwen3-reranker** example usage to match the current **pooling
runner** interface and score output access
``` python
model = LLM(
    model=model_name,
    task="score",       # need fix
    hf_overrides={
        "architectures": ["Qwen3ForSequenceClassification"],
        "classifier_from_token": ["no", "yes"],
```
--->
``` python
model = LLM(
    model=model_name,
    runner="pooling",
    hf_overrides={
        "architectures": ["Qwen3ForSequenceClassification"],
        "classifier_from_token": ["no", "yes"],
```

- modify **PaddleOCR-VL**  parameter `TASK_QUEUE_ENABLE` from `2` to `1`
``` bash
(EngineCore_DP0 pid=26273) RuntimeError: NPUModelRunner init failed, error is NPUModelRunner failed, error
 is Do not support TASK_QUEUE_ENABLE = 2 during NPU graph capture, please export TASK_QUEUE_ENABLE=1/0.
```

These changes are needed because several documentation examples had
drifted from the current runtime behavior and recommended invocation
patterns, which could confuse users when following the tutorials
directly.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
4497431df6

Signed-off-by: MrZ20 <2609716663@qq.com>

2026-03-20 11:34:18 +08:00

14 KiB

Raw Blame History

MiniMax-M2.5

Introduction

MiniMax‑M2.5 is MiniMax’s flagship large language model, reinforced for high‑value scenarios such as code generation, agentic tool calling/search, and complex office workflows, with an emphasis on reasoning efficiency and end‑to‑end speed on challenging tasks.

This document provides a unified deployment guide for MiniMax-M2.5 on vLLM Ascend, covering both:

A3 single-node deployment (Atlas 800 A3)
A2 dual-node deployment (2× Atlas 800I A2)

Environment Preparation

Model Weights

MiniMax-M2.5 (fp8 checkpoint): recommended to use 1× Atlas 800 A3 or 2× Atlas 800I A2 nodes. Download the model weights from MiniMax/MiniMax-M2.5.

It is recommended to download the model weights to a shared directory, such as /mnt/sfs_turbo/.cache/. The current release automatically detects the MiniMax-M2 fp8 checkpoint, disables fp8 quantization kernels on NPU, and loads the weights by dequantizing to bf16. This behavior may be removed once public bf16 weights are available.

Installation

You can use the official docker image to run MiniMax-M2.5 directly.

Select an image based on your machine type and start the container on your node. See using docker.

Run with Docker

A3 (single node)

   :substitutions:
# Update the vllm-ascend image
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend

# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci8 \
--device /dev/davinci9 \
--device /dev/davinci10 \
--device /dev/davinci11 \
--device /dev/davinci12 \
--device /dev/davinci13 \
--device /dev/davinci14 \
--device /dev/davinci15 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /mnt/sfs_turbo/.cache:/home/cache \
-it $IMAGE bash

A2 (dual node, run on both nodes)

Create and run minimax25-docker-run.sh on both A2 nodes.

Notes:

The default configuration assumes an Atlas 800I A2 8-NPU node and sets ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7. Update it based on your hardware.
Map your model weight directory into the container (the example maps it to /opt/data/verification/).

#!/bin/sh
NAME=minimax2_5
DEVICES="0,1,2,3,4,5,6,7"
IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|

docker run -itd -u 0 --ipc=host --privileged \
  -e VLLM_USE_MODELSCOPE=True \
  -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
  -e ASCEND_RT_VISIBLE_DEVICES=$DEVICES \
  --name $NAME \
  --net=host \
  --device /dev/davinci_manager \
  --device /dev/devmm_svm \
  --device /dev/hisi_hdc \
  --shm-size=1200g \
  -v /usr/local/dcmi:/usr/local/dcmi \
  -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
  -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
  -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
  -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
  -v /etc/ascend_install.info:/etc/ascend_install.info \
  -v /home/:/home/ \
  -v /opt/data/verification/:/opt/data/verification/ \   # Map the model weights here
  -v /root/.cache:/root/.cache \
  -v /mnt/performance/:/mnt/performance/ \
  -it $IMAGE bash

# Start and enter the container
# bash minimax25-docker-run.sh
# docker exec -it minimax2_5 bash

Online Inference on Multi-NPU

A3 (single node, tp=16)

Below is a recommended startup configuration (default performance profile: full context + Tool Calling + Reasoning).

Notes:

By default, --max-model-len is not explicitly set. The server reads the model config (M2.5 uses 196608) and enables verified performance parameters.
If you only care about short-context low latency, you can explicitly set --max-model-len 32768.

cd /workspace
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1

vllm serve /models/MiniMax-M2.5 \
  --served-model-name MiniMax-M2.5 \
  --trust-remote-code \
  --dtype bfloat16 \
  --tensor-parallel-size 16 \
  --enable-expert-parallel \
  --max-num-seqs 32 \
  --max-num-batched-tokens 32768 \
  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think \
  --port 8000 \
  > /tmp/minimax-m25-serve.log 2>&1 &

tail -f /tmp/minimax-m25-serve.log

Remarks:

minimax_m2_append_think keeps <think>...</think> inside content.
If you mainly rely on the reasoning semantics of /v1/responses, it is recommended to use --reasoning-parser minimax_m2 instead.

A2 (dual node, tp=8 + dp=2)

Since cross-node tensor parallelism (TP) can be unstable, the dual-node guide uses a tp=8 + dp=2 setup (8 NPUs per node, 16 NPUs total).

Node0 (primary) startup script

Edit minimax25_service_node0.sh inside the node0 container, and replace the placeholders with your actual values:

{PrimaryNodeIP}: the primary node's IP address (public/cluster network)
{NIC}: the NIC name for the public/cluster network (check via ifconfig, e.g., enp67s0f0np0)
VLLM_TORCH_PROFILER_DIR: optional, directory to store profiling outputs

# Primary node (node0)
export HCCL_IF_IP={PrimaryNodeIP}
export GLOO_SOCKET_IFNAME="{NIC}"
export TP_SOCKET_IFNAME="{NIC}"
export HCCL_SOCKET_IFNAME="{NIC}"
export HCCL_BUFFSIZE=1024
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1

export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0

# profiling (optional)
export VLLM_TORCH_PROFILER_WITH_STACK=0
export VLLM_TORCH_PROFILER_DIR="{profiling_dir}"

vllm serve /opt/data/verification/models/MiniMax-M2.5/ \
  --served-model-name "minimax25" \
  --host {PrimaryNodeIP} \
  --port 20004 \
  --tensor-parallel-size 8 \
  --data-parallel-size 2 \
  --data-parallel-size-local 1 \
  --data-parallel-start-rank 0 \
  --data-parallel-address {PrimaryNodeIP} \
  --data-parallel-rpc-port 2347 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 65536 \
  --gpu-memory-utilization 0.92 \
  --enable-expert-parallel \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think \
  --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
  --mm_processor_cache_type="shm" \
  --async-scheduling \
  --additional-config '{"enable_cpu_binding":true}'

Node1 (secondary) startup script

Edit minimax25_service_node1.sh inside the node1 container:

{SecondaryNodeIP}: the secondary node's IP address
{PrimaryNodeIP}: the primary node's IP address (same as node0)
{NIC}: same as above

# Secondary node (node1)
export HCCL_IF_IP={SecondaryNodeIP}
export GLOO_SOCKET_IFNAME="{NIC}"
export TP_SOCKET_IFNAME="{NIC}"
export HCCL_SOCKET_IFNAME="{NIC}"
export HCCL_BUFFSIZE=1024
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

export HCCL_OP_EXPANSION_MODE="AIV"
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

export OMP_PROC_BIND=false
export OMP_NUM_THREADS=1
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1

export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0

# profiling (optional)
export VLLM_TORCH_PROFILER_WITH_STACK=0
export VLLM_TORCH_PROFILER_DIR="{profiling_dir}"

vllm serve /opt/data/verification/models/MiniMax-M2.5/ \
  --served-model-name "minimax25" \
  --host {SecondaryNodeIP} \
  --port 20004 \
  --headless \
  --tensor-parallel-size 8 \
  --data-parallel-size 2 \
  --data-parallel-size-local 1 \
  --data-parallel-start-rank 1 \
  --data-parallel-address {PrimaryNodeIP} \
  --data-parallel-rpc-port 2347 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 65536 \
  --gpu-memory-utilization 0.92 \
  --enable-expert-parallel \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think \
  --compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
  --mm_processor_cache_type="shm" \
  --async-scheduling \
  --additional-config '{"enable_cpu_binding":true}'

Startup order

Start the service on both nodes:

# node0
bash minimax25_service_node0.sh

# node1
bash minimax25_service_node1.sh

After node0 prints service start in logs, you can verify the service.

Verify the Service

A3 (single node)

Test with an OpenAI-compatible client:

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="na")

resp = client.chat.completions.create(
    model="MiniMax-M2.5",
    messages=[{"role": "user", "content": "你好，请介绍一下你自己，并展示一次工具调用的参数格式。"}],
    max_tokens=256,
)
print(resp.choices[0].message.content)

Or send a request using curl:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MiniMax-M2.5",
    "messages": [{"role": "user", "content": "请查询上海的天气。"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get weather by city",
        "parameters": {
          "type": "object",
          "properties": {
            "city": {"type": "string"},
            "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
          },
          "required": ["city"]
        }
      }
    }],
    "tool_choice": "auto",
    "temperature": 0,
    "max_tokens": 512
  }'

A2 (dual node)

Run the following from any machine that can reach the primary node (replace {PrimaryNodeIP} with the real IP):

curl http://{PrimaryNodeIP}:20004/v1/chat/completions \
  -H "Content-type: application/json" \
  -d '{
    "model": "minimax25",
    "messages": [{"role": "user", "content": "Hello, who are you?"}],
    "stream": false,
    "ignore_eos": true,
    "temperature": 0.8,
    "top_p": 0.8,
    "max_tokens": 200
  }'

Performance Reference

A3 (single node, tp=16, 4k/1k@bs16)

Results

Baseline (4k/1k@bs=16)

Metric	Result
Success/Failure	`16/0`
Mean TTFT	`616.20 ms`
Mean TPOT	`31.92 ms`
Mean ITL	`31.92 ms`
Output tok/s	`492.39`
Total tok/s	`2461.95`

Long-context reference (190k/1k@bs=4)

Metric	Result
Output tok/s	`37.12`
Mean TTFT	`2002.37 ms`
Mean TPOT	`105.54 ms`
Mean ITL	`105.54 ms`

A2 (dual node, 190k/1k, concurrency=4, 16 prompts)

Benchmark method

Use vLLM bench for the 190k/1k, concurrency=4, 16 prompts scenario:

vllm bench serve --backend vllm \
  --dataset-name prefix_repetition \
  --prefix-repetition-prefix-len 175104 \   # Input: 190×1024 tokens with 90% prefix repetition
  --prefix-repetition-suffix-len 19440 \    # Input: 190×1024 tokens minus the prefix length above
  --prefix-repetition-output-len 1024 \     # Output: 1024 tokens
  --prefix-repetition-num-prefixes 1 \
  --num-prompts 16 \
  --max-concurrency 4 \
  --ignore-eos \
  --model minimax25 \
  --tokenizer {model_path} \
  --endpoint /v1/completions \
  --request-rate inf \
  --seed 1000 \
  --host {service_ip} \
  --port 20004

Results

190k/1k, concurrency=4, 16 prompts

Metric	Result
TTFT (avg)	3305.25 ms
TPOT (avg)	109.83 ms
Output throughput	35.29 tok/s
Prefix hit rate	85%

FAQ

Q: What should I do if the output is garbled in EP mode?

A: It is recommended to keep --enable-expert-parallel and VLLM_ASCEND_ENABLE_FLASHCOMM1=1.
Q: Why is the reasoning field often empty after using minimax_m2_append_think?

A: This is expected. The parser keeps <think>...</think> inside content. If you mainly rely on the reasoning semantics of /v1/responses, use --reasoning-parser minimax_m2 instead.
Q: Startup fails with HCCL port conflicts (address already bound). What should I do?

A: Clean up old processes and restart: pkill -f "vllm serve /models/MiniMax-M2.5".
Q: How to handle OOM or unstable startup?

A: Reduce --max-num-seqs and --max-num-batched-tokens first. If needed, reduce concurrency and load-testing pressure (e.g., max-concurrency / num-prompts).
Q: Why not use cross-node tp=16?

A: The referenced practice noted that cross-node TP may be unstable, so tp=8, dp=2 is recommended for dual-node deployment.
Q: How should I choose --reasoning-parser?

A: This guide uses minimax_m2_append_think so that <think>...</think> is kept in content. If you mainly rely on the reasoning semantics of /v1/responses, consider using --reasoning-parser minimax_m2.
Q: Which ports must be accessible?

A: At minimum, expose the serving port (e.g., 20004) and the data-parallel RPC port (e.g., 2347), and ensure the two nodes can reach each other over the network.

14 KiB Raw Blame History Unescape Escape

MiniMax-M2.5

Introduction

Environment Preparation

Model Weights

Installation

Run with Docker

A3 (single node)

A2 (dual node, run on both nodes)

Online Inference on Multi-NPU

A3 (single node, tp=16)

A2 (dual node, tp=8 + dp=2)

Node0 (primary) startup script

Node1 (secondary) startup script

Startup order

Verify the Service

A3 (single node)

A2 (dual node)

Performance Reference

A3 (single node, tp=16, 4k/1k@bs16)

Results

A2 (dual node, 190k/1k, concurrency=4, 16 prompts)

Benchmark method

Results

FAQ

14 KiB

Raw Blame History