2026-01-08 16:57:20 +08:00
# Qwen3-Omni-30B-A3B-Thinking
## Introduction
Qwen3-Omni is the natively end-to-end multilingual omni-modal foundation models. It processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech. We introduce several architectural upgrades to improve performance and efficiency. The Thinking model of Qwen3-Omni-30B-A3B, containing the thinker component, equipped with chain-of-thought reasoning, supporting audio, video, and text input, with text output.
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node deployment, accuracy and performance evaluation.
## Supported Features
2026-01-15 09:06:01 +08:00
2026-01-08 16:57:20 +08:00
Refer to [supported features ](https://docs.vllm.ai/projects/ascend/zh-cn/latest/user_guide/support_matrix/supported_models.html ) to get the model's supported feature matrix.
Refer to [feature guide ](https://docs.vllm.ai/projects/ascend/zh-cn/latest/user_guide/feature_guide/index.html ) to get the feature's configuration.
## Environment Preparation
2026-01-15 09:06:01 +08:00
2026-01-08 16:57:20 +08:00
### Model Weight
- `Qwen3-Omni-30B-A3B-Thinking` require 2 NPU Card(64G × 2).[Download model weight ](https://modelscope.cn/models/Qwen/Qwen3-Omni-30B-A3B-Thinking )
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
### Installation
:::::{tab-set}
::::{tab-item} Use docker image
2026-02-10 11:14:57 +08:00
You can use our official docker image to run Qwen3-Omni-30B-A3B-Thinking directly
2026-01-08 16:57:20 +08:00
2026-02-10 15:03:35 +08:00
Select an image based on your machine type and start the docker image on your node, refer to [using docker ](../../installation.md#set-up-using-docker ).
2026-01-08 16:57:20 +08:00
```{code-block} bash
:substitutions:
# Update --device according to your device (Atlas A2: /dev/davinci[0-7] Atlas A3:/dev/davinci[0-15]).
# Update the vllm-ascend image according to your environment.
# Note you should download the weight to /root/.cache in advance.
# Update the vllm-ascend image
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
export NAME=vllm-ascend
# Run the container using the defined variables
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
docker run --rm \
--name $NAME \
--net=host \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash
```
::::
::::{tab-item} Build from source
You can build all from source.
2026-02-10 15:03:35 +08:00
- Install `vllm-ascend` , refer to [set up using python ](../../installation.md#set-up-using-python ).
2026-01-08 16:57:20 +08:00
::::
:::::
Please install system dependencies
```bash
pip install qwen_omni_utils modelscope
# Used for audio processing.
apt-get update & & apt-get install ffmpeg -y
# Check the installation.
ffmpeg -version
```
## Deployment
2026-01-15 09:06:01 +08:00
2026-01-08 16:57:20 +08:00
### Single-node Deployment
2026-01-15 09:06:01 +08:00
2026-01-08 16:57:20 +08:00
#### Offline Inference on Multi-NPU
Run the following script to execute offline inference on multi-NPU:
```python
import gc
import torch
import os
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (
destroy_distributed_environment,
destroy_model_parallel
)
from modelscope import Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info
os.environ["HCCL_BUFFSIZE"] = "1024"
def clean_up():
"""Clean up distributed resources and NPU memory"""
destroy_model_parallel()
destroy_distributed_environment()
gc.collect() # Garbage collection to free up memory
torch.npu.empty_cache()
def main():
[Doc]Refresh model tutorial examples and serving commands (#7426)
### What this PR does / why we need it?
Main updates include:
- update model IDs and default model paths in serving / offline
inference examples
- adjust some command snippets and notes for better copy-paste usability
- replace `SamplingParams` argument usage from `max_completion_tokens`
to `max_tokens`(**Offline** inference currently **does not support** the
"max_completion_tokens")
``` bash
Traceback (most recent call last):
File "/vllm-workspace/vllm-ascend/qwen-next.py", line 18, in <module>
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Unexpected keyword argument 'max_completion_tokens'
[ERROR] 2026-03-17-09:57:40 (PID:276, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
```
- refresh **Qwen3-Omni-30B-A3B-Thinking** recommended environment
variable
``` bash
export HCCL_BUFFSIZE=512
export HCCL_OP_EXPANSION_MODE=AIV
```
``` bash
EZ9999[PID: 25038] 2026-03-17-08:21:12.001.372 (EZ9999): HCCL_BUFFSIZE is too SMALL, maxBs = 256, h = 2048,
epWorldSize = 2, localMoeExpertNum = 64, sharedExpertNum = 0, tokenNeedSizeDispatch = 4608, tokenNeedSizeCombine
= 4096, k = 8, NEEDED_HCCL_BUFFSIZE(((maxBs * tokenNeedSizeDispatch * ep_worldsize * localMoeExpertNum) +
(maxBs * tokenNeedSizeCombine * (k + sharedExpertNum))) * 2) = 305MB, HCCL_BUFFSIZE=200MB.
[FUNC:CheckWinSize][FILE:moe_distribute_dispatch_v2_tiling.cpp][LINE:984]
```
- fix **Qwen3-reranker** example usage to match the current **pooling
runner** interface and score output access
``` python
model = LLM(
model=model_name,
task="score", # need fix
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
```
--->
``` python
model = LLM(
model=model_name,
runner="pooling",
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
```
- modify **PaddleOCR-VL** parameter `TASK_QUEUE_ENABLE` from `2` to `1`
``` bash
(EngineCore_DP0 pid=26273) RuntimeError: NPUModelRunner init failed, error is NPUModelRunner failed, error
is Do not support TASK_QUEUE_ENABLE = 2 during NPU graph capture, please export TASK_QUEUE_ENABLE=1/0.
```
These changes are needed because several documentation examples had
drifted from the current runtime behavior and recommended invocation
patterns, which could confuse users when following the tutorials
directly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4497431df654e46fb1fb5e64bf8611e762ae5d87
Signed-off-by: MrZ20 <2609716663@qq.com>
2026-03-20 11:34:18 +08:00
MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"
2026-01-08 16:57:20 +08:00
llm = LLM(
model=MODEL_PATH,
tensor_parallel_size=2,
enable_expert_parallel=True,
distributed_executor_backend="mp",
limit_mm_per_prompt={'image': 5, 'video': 2, 'audio': 3},
max_model_len=32768,
)
sampling_params = SamplingParams(
temperature=0.6,
top_p=0.95,
top_k=20,
[Doc]Refresh model tutorial examples and serving commands (#7426)
### What this PR does / why we need it?
Main updates include:
- update model IDs and default model paths in serving / offline
inference examples
- adjust some command snippets and notes for better copy-paste usability
- replace `SamplingParams` argument usage from `max_completion_tokens`
to `max_tokens`(**Offline** inference currently **does not support** the
"max_completion_tokens")
``` bash
Traceback (most recent call last):
File "/vllm-workspace/vllm-ascend/qwen-next.py", line 18, in <module>
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Unexpected keyword argument 'max_completion_tokens'
[ERROR] 2026-03-17-09:57:40 (PID:276, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
```
- refresh **Qwen3-Omni-30B-A3B-Thinking** recommended environment
variable
``` bash
export HCCL_BUFFSIZE=512
export HCCL_OP_EXPANSION_MODE=AIV
```
``` bash
EZ9999[PID: 25038] 2026-03-17-08:21:12.001.372 (EZ9999): HCCL_BUFFSIZE is too SMALL, maxBs = 256, h = 2048,
epWorldSize = 2, localMoeExpertNum = 64, sharedExpertNum = 0, tokenNeedSizeDispatch = 4608, tokenNeedSizeCombine
= 4096, k = 8, NEEDED_HCCL_BUFFSIZE(((maxBs * tokenNeedSizeDispatch * ep_worldsize * localMoeExpertNum) +
(maxBs * tokenNeedSizeCombine * (k + sharedExpertNum))) * 2) = 305MB, HCCL_BUFFSIZE=200MB.
[FUNC:CheckWinSize][FILE:moe_distribute_dispatch_v2_tiling.cpp][LINE:984]
```
- fix **Qwen3-reranker** example usage to match the current **pooling
runner** interface and score output access
``` python
model = LLM(
model=model_name,
task="score", # need fix
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
```
--->
``` python
model = LLM(
model=model_name,
runner="pooling",
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
```
- modify **PaddleOCR-VL** parameter `TASK_QUEUE_ENABLE` from `2` to `1`
``` bash
(EngineCore_DP0 pid=26273) RuntimeError: NPUModelRunner init failed, error is NPUModelRunner failed, error
is Do not support TASK_QUEUE_ENABLE = 2 during NPU graph capture, please export TASK_QUEUE_ENABLE=1/0.
```
These changes are needed because several documentation examples had
drifted from the current runtime behavior and recommended invocation
patterns, which could confuse users when following the tutorials
directly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4497431df654e46fb1fb5e64bf8611e762ae5d87
Signed-off-by: MrZ20 <2609716663@qq.com>
2026-03-20 11:34:18 +08:00
max_tokens=16384,
2026-01-08 16:57:20 +08:00
)
processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
messages = [
{
"role": "user",
"content": [
{"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"},
{"type": "text", "text": "What can you see and hear? Answer in one sentence."}
]
}
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# 'use_audio_in_video = True' requires equal number of audio and video items, including audio from the video.
audios, images, videos = process_mm_info(messages, use_audio_in_video=True)
inputs = {
"prompt": text,
"multi_modal_data": {},
"mm_processor_kwargs": {"use_audio_in_video": True}
}
if images is not None:
inputs['multi_modal_data']['image'] = images
if videos is not None:
inputs['multi_modal_data']['video'] = videos
if audios is not None:
inputs['multi_modal_data']['audio'] = audios
outputs = llm.generate([inputs], sampling_params=sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()
if __name__ == "__main__":
main()
```
#### Online Inference on Multi-NPU
Run the following script to start the vLLM server on Multi-NPU:
For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 1, and for 32 GB of memory, tensor-parallel-size should be at least 2.
[Doc]Refresh model tutorial examples and serving commands (#7426)
### What this PR does / why we need it?
Main updates include:
- update model IDs and default model paths in serving / offline
inference examples
- adjust some command snippets and notes for better copy-paste usability
- replace `SamplingParams` argument usage from `max_completion_tokens`
to `max_tokens`(**Offline** inference currently **does not support** the
"max_completion_tokens")
``` bash
Traceback (most recent call last):
File "/vllm-workspace/vllm-ascend/qwen-next.py", line 18, in <module>
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40, max_completion_tokens=32)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Unexpected keyword argument 'max_completion_tokens'
[ERROR] 2026-03-17-09:57:40 (PID:276, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception
```
- refresh **Qwen3-Omni-30B-A3B-Thinking** recommended environment
variable
``` bash
export HCCL_BUFFSIZE=512
export HCCL_OP_EXPANSION_MODE=AIV
```
``` bash
EZ9999[PID: 25038] 2026-03-17-08:21:12.001.372 (EZ9999): HCCL_BUFFSIZE is too SMALL, maxBs = 256, h = 2048,
epWorldSize = 2, localMoeExpertNum = 64, sharedExpertNum = 0, tokenNeedSizeDispatch = 4608, tokenNeedSizeCombine
= 4096, k = 8, NEEDED_HCCL_BUFFSIZE(((maxBs * tokenNeedSizeDispatch * ep_worldsize * localMoeExpertNum) +
(maxBs * tokenNeedSizeCombine * (k + sharedExpertNum))) * 2) = 305MB, HCCL_BUFFSIZE=200MB.
[FUNC:CheckWinSize][FILE:moe_distribute_dispatch_v2_tiling.cpp][LINE:984]
```
- fix **Qwen3-reranker** example usage to match the current **pooling
runner** interface and score output access
``` python
model = LLM(
model=model_name,
task="score", # need fix
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
```
--->
``` python
model = LLM(
model=model_name,
runner="pooling",
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],
```
- modify **PaddleOCR-VL** parameter `TASK_QUEUE_ENABLE` from `2` to `1`
``` bash
(EngineCore_DP0 pid=26273) RuntimeError: NPUModelRunner init failed, error is NPUModelRunner failed, error
is Do not support TASK_QUEUE_ENABLE = 2 during NPU graph capture, please export TASK_QUEUE_ENABLE=1/0.
```
These changes are needed because several documentation examples had
drifted from the current runtime behavior and recommended invocation
patterns, which could confuse users when following the tutorials
directly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/4497431df654e46fb1fb5e64bf8611e762ae5d87
Signed-off-by: MrZ20 <2609716663@qq.com>
2026-03-20 11:34:18 +08:00
```bash
export HCCL_BUFFSIZE=512
export HCCL_OP_EXPANSION_MODE=AIV
```
2026-01-08 16:57:20 +08:00
```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --tensor-parallel-size 2 --enable_expert_parallel
```
## Functional Verification
2026-01-15 09:06:01 +08:00
2026-01-08 16:57:20 +08:00
Once your server is started, you can query the model with input prompts.
```bash
curl http://localhost:8000/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-Omni-30B-A3B-Thinking",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"
}
},
{
"type": "audio_url",
"audio_url": {
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"
}
},
{
"type": "video_url",
"video_url": {
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"
}
},
{
"type": "text",
"text": "Analyze this audio, image, and video together."
}
]
}
]
}'
```
## Accuracy Evaluation
Here are accuracy evaluation methods.
### Using EvalScope
As an example, take the `gsm8k` `omnibench` `bbh` dataset as a test dataset, and run accuracy evaluation of `Qwen3-Omni-30B-A3B-Thinking` in online mode.
2026-01-15 09:06:01 +08:00
1. Refer to Using evalscope(< https: // docs . vllm . ai / projects / ascend / en / latest / developer_guide / evaluation / using_evalscope . html #install -evalscope-using-pip > ) for `evalscope` installation.
2026-01-08 16:57:20 +08:00
2. Run `evalscope` to execute the accuracy evaluation.
```bash
evalscope eval \
--model /root/.cache/modelscope/hub/models/Qwen/Qwen3-Omni-30B-A3B-Thinking \
--api-url http://localhost:8000/v1 \
--api-key EMPTY \
--eval-type server \
--datasets omni_bench, gsm8k, bbh \
--dataset-args '{"omni_bench": { "extra_params": { "use_image": true, "use_audio": false}}}' \
--eval-batch-size 1 \
2026-01-26 11:57:40 +08:00
--generation-config '{"max_completion_tokens": 10000, "temperature": 0.6}' \
2026-01-08 16:57:20 +08:00
--limit 100
```
3. After execution, you can get the result, here is the result of `Qwen3-Omni-30B-A3B-Thinking` in vllm-ascend:0.13.0rc1 for reference only.
```bash
+-----------------------------+------------+----------+----------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+=============================+============+==========+==========+=======+=========+=========+
| Qwen3-Omni-30B-A3B-Thinking | omni_bench | mean_acc | default | 100 | 0.44 | default |
+-----------------------------+------------+----------+----------+-------+---------+---------+
| Qwen3-Omni-30B-A3B-Thinking | gsm8k | mean_acc | main | 100 | 0.98 | default |
+-----------------------------+-----------+----------+----------+-------+---------+---------+
| Qwen3-Omni-30B-A3B-Thinking | bbh | mean_acc | OVERALL | 270 | 0.9148 | |
+-----------------------------+------------+----------+----------+-------+---------+---------+
```
## Performance
### Using vLLM Benchmark
2026-01-15 09:06:01 +08:00
2026-01-08 16:57:20 +08:00
Run performance evaluation of `Qwen3-Omni-30B-A3B-Thinking` as an example.
Refer to vllm benchmark for more details.
Refer to [vllm benchmark ](https://docs.vllm.ai/en/latest/contributing/benchmarks.html ) for more details.
2026-02-10 11:14:57 +08:00
There are three `vllm bench` subcommands:
2026-01-15 09:06:01 +08:00
2026-01-08 16:57:20 +08:00
- `latency` : Benchmark the latency of a single batch of requests.
- `serve` : Benchmark the online serving throughput.
- `throughput` : Benchmark offline inference throughput.
Take the `serve` as an example. Run the code as follows.
```bash
VLLM_USE_MODELSCOPE=True
export MODEL=Qwen/Qwen3-Omni-30B-A3B-Thinking
python3 -m vllm.entrypoints.openai.api_server --model $MODEL --tensor-parallel-size 2 --swap-space 16 --disable-log-stats --disable-log-request --load-format dummy
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
pip install -r vllm-ascend/benchmarks/requirements-bench.txt
2026-02-13 15:50:05 +08:00
vllm bench serve --model $MODEL --dataset-name random --random-input 200 --num-prompts 200 --request-rate 1 --save-result --result-dir ./
2026-01-08 16:57:20 +08:00
```
After execution, you can get the result, here is the result of `Qwen3-Omni-30B-A3B-Thinking` in vllm-ascend:0.13.0rc1 for reference only.
```bash
============ Serving Benchmark Result ============
Successful requests: 200
Failed requests: 0
Request rate configured (RPS): 1.00
Benchmark duration (s): 211.90
Total input tokens: 40000
Total generated tokens: 25600
Request throughput (req/s): 0.94
Output token throughput (tok/s): 120.81
Peak output token throughput (tok/s): 216.00
Peak concurrent requests: 24.00
Total token throughput (tok/s): 309.58
---------------Time to First Token----------------
Mean TTFT (ms): 215.50
Median TTFT (ms): 211.51
P99 TTFT (ms): 317.18
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 98.96
Median TPOT (ms): 99.19
P99 TPOT (ms): 101.52
---------------Inter-token Latency----------------
Mean ITL (ms): 99.02
Median ITL (ms): 96.10
P99 ITL (ms): 176.02
==================================================
```