### What this PR does / why we need it?
Main updates include:
- update model IDs and default model paths in serving / offline
inference examples
- adjust some command snippets and notes for better copy-paste usability
- replace `SamplingParams` argument usage from `max_completion_tokens`
to `max_tokens`(**Offline** inference currently **does not support** the
"max_completion_tokens")
``` bash
Traceback (most recent call last):
File "/vllm-workspace/vllm-ascend/qwen-next.py", line 18, in <module>
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Unexpected keyword argument 'max_completion_tokens'
[ERROR] 2026-03-17-09:57:40 (PID:276, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
```
- refresh **Qwen3-Omni-30B-A3B-Thinking** recommended environment
variable
``` bash
export HCCL_BUFFSIZE=512
export HCCL_OP_EXPANSION_MODE=AIV
```
``` bash
EZ9999[PID: 25038] 2026-03-17-08:21:12.001.372 (EZ9999): HCCL_BUFFSIZE is too SMALL, maxBs = 256, h = 2048,
epWorldSize = 2, localMoeExpertNum = 64, sharedExpertNum = 0, tokenNeedSizeDispatch = 4608, tokenNeedSizeCombine
= 4096, k = 8, NEEDED_HCCL_BUFFSIZE(((maxBs * tokenNeedSizeDispatch * ep_worldsize * localMoeExpertNum) +
(maxBs * tokenNeedSizeCombine * (k + sharedExpertNum))) * 2) = 305MB, HCCL_BUFFSIZE=200MB.
[FUNC:CheckWinSize][FILE:moe_distribute_dispatch_v2_tiling.cpp][LINE:984]
```
- fix **Qwen3-reranker** example usage to match the current **pooling
runner** interface and score output access
``` python
model = LLM(
model=model_name,
task="score", # need fix
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
```
--->
``` python
model = LLM(
model=model_name,
runner="pooling",
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
```
- modify **PaddleOCR-VL** parameter `TASK_QUEUE_ENABLE` from `2` to `1`
``` bash
(EngineCore_DP0 pid=26273) RuntimeError: NPUModelRunner init failed, error is NPUModelRunner failed, error
is Do not support TASK_QUEUE_ENABLE = 2 during NPU graph capture, please export TASK_QUEUE_ENABLE=1/0.
```
These changes are needed because several documentation examples had
drifted from the current runtime behavior and recommended invocation
patterns, which could confuse users when following the tutorials
directly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
4497431df6
Signed-off-by: MrZ20 <2609716663@qq.com>
183 lines
6.1 KiB
Markdown
183 lines
6.1 KiB
Markdown
# Qwen3-Next
|
|
|
|
## Introduction
|
|
|
|
The Qwen3-Next model is a sparse MoE (Mixture of Experts) model with high sparsity. Compared to the MoE architecture of Qwen3, it has introduced key improvements in aspects such as the hybrid attention mechanism and multi-token prediction mechanism, enhancing the training and inference efficiency of the model under long contexts and large total parameter scales.
|
|
|
|
This document will present the core verification steps of the model, including supported features, environment preparation, as well as accuracy and performance evaluation. Qwen3 Next is currently using Triton Ascend, which is in the experimental phase. In subsequent versions, its performance related to stability and accuracy may change, and performance will be continuously optimized.
|
|
|
|
The `Qwen3-Next` model is first supported in `vllm-ascend:v0.10.2rc1`.
|
|
|
|
## Supported Features
|
|
|
|
Refer to [supported features](../../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix.
|
|
|
|
Refer to [feature guide](../../user_guide/feature_guide/index.md) to get the feature's configuration.
|
|
|
|
## Weight Preparation
|
|
|
|
Download Link for the `Qwen3-Next-80B-A3B-Instruct` Model Weights: [Download model weight](https://modelscope.cn/models/Qwen/Qwen3-Next-80B-A3B-Instruct)
|
|
|
|
## Deployment
|
|
|
|
If the machine environment is an Atlas 800I A3(64G*16), the deployment approach stays identical.
|
|
|
|
### Run docker container
|
|
|
|
```{code-block} bash
|
|
:substitutions:
|
|
# Update the vllm-ascend image
|
|
# For Atlas A2 machines:
|
|
# export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
|
# For Atlas A3 machines:
|
|
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3
|
|
docker run --rm \
|
|
--shm-size=1g \
|
|
--name vllm-ascend-qwen3 \
|
|
--device /dev/davinci0 \
|
|
--device /dev/davinci1 \
|
|
--device /dev/davinci2 \
|
|
--device /dev/davinci3 \
|
|
--device /dev/davinci_manager \
|
|
--device /dev/devmm_svm \
|
|
--device /dev/hisi_hdc \
|
|
-v /usr/local/dcmi:/usr/local/dcmi \
|
|
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
|
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
|
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
|
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
|
-v /root/.cache:/root/.cache \
|
|
-p 8000:8000 \
|
|
-it $IMAGE bash
|
|
```
|
|
|
|
The Qwen3 Next is using [Triton Ascend](https://gitee.com/ascend/triton-ascend) which is currently experimental. In future versions, there may be behavioral changes related to stability, accuracy, and performance improvement.
|
|
|
|
### Inference
|
|
|
|
:::::{tab-set}
|
|
::::{tab-item} Online Inference
|
|
|
|
Run the following script to start the vLLM server on multi-NPU:
|
|
|
|
```bash
|
|
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --tensor-parallel-size 4 --max-model-len 32768 --gpu-memory-utilization 0.8 --max-num-batched-tokens 4096 --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'
|
|
```
|
|
|
|
Once your server is started, you can query the model with input prompts.
|
|
|
|
```bash
|
|
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
|
|
"model": "Qwen/Qwen3-Next-80B-A3B-Instruct",
|
|
"messages": [
|
|
{"role": "user", "content": "Who are you?"}
|
|
],
|
|
"temperature": 0.6,
|
|
"top_p": 0.95,
|
|
"top_k": 20,
|
|
"max_completion_tokens": 32
|
|
}'
|
|
```
|
|
|
|
::::
|
|
|
|
::::{tab-item} Offline Inference
|
|
|
|
Run the following script to execute offline inference on multi-NPU:
|
|
|
|
```python
|
|
import gc
|
|
import torch
|
|
|
|
from vllm import LLM, SamplingParams
|
|
from vllm.distributed.parallel_state import (destroy_distributed_environment,
|
|
destroy_model_parallel)
|
|
|
|
def clean_up():
|
|
destroy_model_parallel()
|
|
destroy_distributed_environment()
|
|
gc.collect()
|
|
torch.npu.empty_cache()
|
|
|
|
if __name__ == '__main__':
|
|
prompts = [
|
|
"Who are you?",
|
|
]
|
|
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_tokens=32)
|
|
llm = LLM(model="Qwen/Qwen3-Next-80B-A3B-Instruct",
|
|
tensor_parallel_size=4,
|
|
enforce_eager=True,
|
|
distributed_executor_backend="mp",
|
|
gpu_memory_utilization=0.7,
|
|
max_model_len=4096)
|
|
|
|
outputs = llm.generate(prompts, sampling_params)
|
|
for output in outputs:
|
|
prompt = output.prompt
|
|
generated_text = output.outputs[0].text
|
|
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
|
|
|
del llm
|
|
clean_up()
|
|
```
|
|
|
|
If you run this script successfully, you can see the info shown below:
|
|
|
|
```bash
|
|
Prompt: 'Who are you?', Generated text: ' What do you know about me?\n\nHello! I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I am'
|
|
```
|
|
|
|
::::
|
|
:::::
|
|
|
|
## Accuracy Evaluation
|
|
|
|
### Using AISBench
|
|
|
|
1. Refer to [Using AISBench](../../developer_guide/evaluation/using_ais_bench.md) for details.
|
|
|
|
2. After execution, you can get the result, here is the result of `Qwen3-Next-80B-A3B-Instruct` in `vllm-ascend:0.13.0rc1` for reference only.
|
|
|
|
| dataset | version | metric | mode | vllm-api-general-chat |
|
|
|----- | ----- | ----- | ----- | -----|
|
|
| gsm8k | - | accuracy | gen | 95.53 |
|
|
|
|
## Performance
|
|
|
|
### Using AISBench
|
|
|
|
Refer to [Using AISBench for performance evaluation](../../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details.
|
|
|
|
### Using vLLM Benchmark
|
|
|
|
Run performance evaluation of `Qwen3-Next` as an example.
|
|
|
|
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
|
|
|
There are three `vllm bench` subcommands:
|
|
|
|
- `latency`: Benchmark the latency of a single batch of requests.
|
|
- `serve`: Benchmark the online serving throughput.
|
|
- `throughput`: Benchmark offline inference throughput.
|
|
|
|
Take the `serve` as an example. Run the code as follows.
|
|
|
|
```shell
|
|
export VLLM_USE_MODELSCOPE=true
|
|
vllm bench serve --model Qwen/Qwen3-Next-80B-A3B-Instruct --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
|
|
```
|
|
|
|
After about several minutes, you can get the performance evaluation result.
|
|
|
|
The performance result is:
|
|
|
|
**Hardware**: A3-752T, 2 node
|
|
|
|
**Deployment**: TP4 + Full Decode Only
|
|
|
|
**Input/Output**: 2k/2k
|
|
|
|
**Concurrency**: 32
|
|
|
|
**Performance**: 580tps, TPOT 54ms
|