[Doc] Add Qwen3-Omni-30B-A3B-Thinking Tutorials (#3991)
### What this PR does / why we need it?
Add Qwen3-Omni-30B-A3B-Thinking Tutorials
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
5326c89803
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
This commit is contained in:
311
docs/source/tutorials/Qwen3-Omni-30B-A3B-Thinking.md
Normal file
311
docs/source/tutorials/Qwen3-Omni-30B-A3B-Thinking.md
Normal file
@@ -0,0 +1,311 @@
|
||||
# Qwen3-Omni-30B-A3B-Thinking
|
||||
|
||||
## Introduction
|
||||
|
||||
Qwen3-Omni is the natively end-to-end multilingual omni-modal foundation models. It processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech. We introduce several architectural upgrades to improve performance and efficiency. The Thinking model of Qwen3-Omni-30B-A3B, containing the thinker component, equipped with chain-of-thought reasoning, supporting audio, video, and text input, with text output.
|
||||
|
||||
This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node deployment, accuracy and performance evaluation.
|
||||
|
||||
## Supported Features
|
||||
Refer to [supported features](https://docs.vllm.ai/projects/ascend/zh-cn/latest/user_guide/support_matrix/supported_models.html) to get the model's supported feature matrix.
|
||||
|
||||
Refer to [feature guide](https://docs.vllm.ai/projects/ascend/zh-cn/latest/user_guide/feature_guide/index.html) to get the feature's configuration.
|
||||
|
||||
## Environment Preparation
|
||||
### Model Weight
|
||||
|
||||
- `Qwen3-Omni-30B-A3B-Thinking` require 2 NPU Card(64G × 2).[Download model weight](https://modelscope.cn/models/Qwen/Qwen3-Omni-30B-A3B-Thinking)
|
||||
It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/`
|
||||
|
||||
### Installation
|
||||
|
||||
:::::{tab-set}
|
||||
::::{tab-item} Use docker image
|
||||
|
||||
You can using our official docker image to run Qwen3-Omni-30B-A3B-Thinking directly
|
||||
|
||||
Select an image based on your machine type and start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker).
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update --device according to your device (Atlas A2: /dev/davinci[0-7] Atlas A3:/dev/davinci[0-15]).
|
||||
# Update the vllm-ascend image according to your environment.
|
||||
# Note you should download the weight to /root/.cache in advance.
|
||||
# Update the vllm-ascend image
|
||||
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
|
||||
export NAME=vllm-ascend
|
||||
|
||||
# Run the container using the defined variables
|
||||
# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance
|
||||
docker run --rm \
|
||||
--name $NAME \
|
||||
--net=host \
|
||||
--shm-size=1g \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci1 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
-v /root/.cache:/root/.cache \
|
||||
-it $IMAGE bash
|
||||
```
|
||||
|
||||
::::
|
||||
::::{tab-item} Build from source
|
||||
|
||||
You can build all from source.
|
||||
|
||||
- Install `vllm-ascend`, refer to [set up using python](../installation.md#set-up-using-python).
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
Please install system dependencies
|
||||
|
||||
```bash
|
||||
pip install qwen_omni_utils modelscope
|
||||
# Used for audio processing.
|
||||
apt-get update && apt-get install ffmpeg -y
|
||||
# Check the installation.
|
||||
ffmpeg -version
|
||||
```
|
||||
|
||||
## Deployment
|
||||
### Single-node Deployment
|
||||
#### Offline Inference on Multi-NPU
|
||||
|
||||
Run the following script to execute offline inference on multi-NPU:
|
||||
|
||||
```python
|
||||
import gc
|
||||
import torch
|
||||
import os
|
||||
from vllm import LLM, SamplingParams
|
||||
from vllm.distributed.parallel_state import (
|
||||
destroy_distributed_environment,
|
||||
destroy_model_parallel
|
||||
)
|
||||
from modelscope import Qwen3OmniMoeProcessor
|
||||
from qwen_omni_utils import process_mm_info
|
||||
|
||||
os.environ["HCCL_BUFFSIZE"] = "1024"
|
||||
|
||||
def clean_up():
|
||||
"""Clean up distributed resources and NPU memory"""
|
||||
destroy_model_parallel()
|
||||
destroy_distributed_environment()
|
||||
gc.collect() # Garbage collection to free up memory
|
||||
torch.npu.empty_cache()
|
||||
|
||||
|
||||
def main():
|
||||
MODEL_PATH = "Qwen3/Qwen3-Omni-30B-A3B-Thinking"
|
||||
llm = LLM(
|
||||
model=MODEL_PATH,
|
||||
tensor_parallel_size=2,
|
||||
enable_expert_parallel=True,
|
||||
distributed_executor_backend="mp",
|
||||
limit_mm_per_prompt={'image': 5, 'video': 2, 'audio': 3},
|
||||
max_model_len=32768,
|
||||
)
|
||||
|
||||
sampling_params = SamplingParams(
|
||||
temperature=0.6,
|
||||
top_p=0.95,
|
||||
top_k=20,
|
||||
max_tokens=16384,
|
||||
)
|
||||
|
||||
processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"},
|
||||
{"type": "text", "text": "What can you see and hear? Answer in one sentence."}
|
||||
]
|
||||
}
|
||||
]
|
||||
|
||||
text = processor.apply_chat_template(
|
||||
messages,
|
||||
tokenize=False,
|
||||
add_generation_prompt=True
|
||||
)
|
||||
# 'use_audio_in_video = True' requires equal number of audio and video items, including audio from the video.
|
||||
audios, images, videos = process_mm_info(messages, use_audio_in_video=True)
|
||||
|
||||
inputs = {
|
||||
"prompt": text,
|
||||
"multi_modal_data": {},
|
||||
"mm_processor_kwargs": {"use_audio_in_video": True}
|
||||
}
|
||||
if images is not None:
|
||||
inputs['multi_modal_data']['image'] = images
|
||||
if videos is not None:
|
||||
inputs['multi_modal_data']['video'] = videos
|
||||
if audios is not None:
|
||||
inputs['multi_modal_data']['audio'] = audios
|
||||
|
||||
outputs = llm.generate([inputs], sampling_params=sampling_params)
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
|
||||
del llm
|
||||
clean_up()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
```
|
||||
|
||||
#### Online Inference on Multi-NPU
|
||||
|
||||
Run the following script to start the vLLM server on Multi-NPU:
|
||||
For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 1, and for 32 GB of memory, tensor-parallel-size should be at least 2.
|
||||
|
||||
```bash
|
||||
vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --tensor-parallel-size 2 --enable_expert_parallel
|
||||
```
|
||||
|
||||
## Functional Verification
|
||||
Once your server is started, you can query the model with input prompts.
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions \
|
||||
-X POST \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Qwen/Qwen3-Omni-30B-A3B-Thinking",
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "image_url",
|
||||
"image_url": {
|
||||
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "audio_url",
|
||||
"audio_url": {
|
||||
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "video_url",
|
||||
"video_url": {
|
||||
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"
|
||||
}
|
||||
|
||||
},
|
||||
{
|
||||
"type": "text",
|
||||
"text": "Analyze this audio, image, and video together."
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
## Accuracy Evaluation
|
||||
|
||||
Here are accuracy evaluation methods.
|
||||
|
||||
### Using EvalScope
|
||||
|
||||
As an example, take the `gsm8k` `omnibench` `bbh` dataset as a test dataset, and run accuracy evaluation of `Qwen3-Omni-30B-A3B-Thinking` in online mode.
|
||||
1. Refer to Using evalscope(https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_evalscope.html#install-evalscope-using-pip) for `evalscope`installation.
|
||||
2. Run `evalscope` to execute the accuracy evaluation.
|
||||
|
||||
```bash
|
||||
evalscope eval \
|
||||
--model /root/.cache/modelscope/hub/models/Qwen/Qwen3-Omni-30B-A3B-Thinking \
|
||||
--api-url http://localhost:8000/v1 \
|
||||
--api-key EMPTY \
|
||||
--eval-type server \
|
||||
--datasets omni_bench, gsm8k, bbh \
|
||||
--dataset-args '{"omni_bench": { "extra_params": { "use_image": true, "use_audio": false}}}' \
|
||||
--eval-batch-size 1 \
|
||||
--generation-config '{"max_tokens": 10000, "temperature": 0.6}' \
|
||||
--limit 100
|
||||
```
|
||||
|
||||
3. After execution, you can get the result, here is the result of `Qwen3-Omni-30B-A3B-Thinking` in vllm-ascend:0.13.0rc1 for reference only.
|
||||
|
||||
```bash
|
||||
+-----------------------------+------------+----------+----------+-------+---------+---------+
|
||||
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
|
||||
+=============================+============+==========+==========+=======+=========+=========+
|
||||
| Qwen3-Omni-30B-A3B-Thinking | omni_bench | mean_acc | default | 100 | 0.44 | default |
|
||||
+-----------------------------+------------+----------+----------+-------+---------+---------+
|
||||
| Qwen3-Omni-30B-A3B-Thinking | gsm8k | mean_acc | main | 100 | 0.98 | default |
|
||||
+-----------------------------+-----------+----------+----------+-------+---------+---------+
|
||||
| Qwen3-Omni-30B-A3B-Thinking | bbh | mean_acc | OVERALL | 270 | 0.9148 | |
|
||||
+-----------------------------+------------+----------+----------+-------+---------+---------+
|
||||
```
|
||||
|
||||
## Performance
|
||||
|
||||
### Using vLLM Benchmark
|
||||
Run performance evaluation of `Qwen3-Omni-30B-A3B-Thinking` as an example.
|
||||
Refer to vllm benchmark for more details.
|
||||
Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details.
|
||||
|
||||
There are three `vllm bench` subcommand:
|
||||
- `latency`: Benchmark the latency of a single batch of requests.
|
||||
- `serve`: Benchmark the online serving throughput.
|
||||
- `throughput`: Benchmark offline inference throughput.
|
||||
|
||||
Take the `serve` as an example. Run the code as follows.
|
||||
|
||||
```bash
|
||||
VLLM_USE_MODELSCOPE=True
|
||||
export MODEL=Qwen/Qwen3-Omni-30B-A3B-Thinking
|
||||
python3 -m vllm.entrypoints.openai.api_server --model $MODEL --tensor-parallel-size 2 --swap-space 16 --disable-log-stats --disable-log-request --load-format dummy
|
||||
|
||||
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
pip install -r vllm-ascend/benchmarks/requirements-bench.txt
|
||||
|
||||
vllm bench serve --model $MODEL --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./
|
||||
```
|
||||
|
||||
After execution, you can get the result, here is the result of `Qwen3-Omni-30B-A3B-Thinking` in vllm-ascend:0.13.0rc1 for reference only.
|
||||
|
||||
```bash
|
||||
============ Serving Benchmark Result ============
|
||||
Successful requests: 200
|
||||
Failed requests: 0
|
||||
Request rate configured (RPS): 1.00
|
||||
Benchmark duration (s): 211.90
|
||||
Total input tokens: 40000
|
||||
Total generated tokens: 25600
|
||||
Request throughput (req/s): 0.94
|
||||
Output token throughput (tok/s): 120.81
|
||||
Peak output token throughput (tok/s): 216.00
|
||||
Peak concurrent requests: 24.00
|
||||
Total token throughput (tok/s): 309.58
|
||||
---------------Time to First Token----------------
|
||||
Mean TTFT (ms): 215.50
|
||||
Median TTFT (ms): 211.51
|
||||
P99 TTFT (ms): 317.18
|
||||
-----Time per Output Token (excl. 1st token)------
|
||||
Mean TPOT (ms): 98.96
|
||||
Median TPOT (ms): 99.19
|
||||
P99 TPOT (ms): 101.52
|
||||
---------------Inter-token Latency----------------
|
||||
Mean ITL (ms): 99.02
|
||||
Median ITL (ms): 96.10
|
||||
P99 ITL (ms): 176.02
|
||||
==================================================
|
||||
```
|
||||
@@ -16,6 +16,7 @@ Qwen3_reranker
|
||||
Qwen3-8B-W4A8
|
||||
Qwen3-32B-W4A4
|
||||
Qwen3-Next
|
||||
Qwen3-Omni-30B-A3B-Thinking.md
|
||||
DeepSeek-V3.1.md
|
||||
DeepSeek-V3.2.md
|
||||
DeepSeek-R1.md
|
||||
|
||||
@@ -62,7 +62,8 @@ Get the latest info here: https://github.com/vllm-project/vllm-ascend/issues/160
|
||||
| Qwen2-VL | ✅ | || A2/A3 |||||||||||||||||
|
||||
| Qwen2.5-VL | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ ||| ✅ | ✅ |||| ✅ | ✅ | ✅ | 30k || [Qwen-VL-Dense](../../tutorials/Qwen-VL-Dense.md) |
|
||||
| Qwen3-VL | ✅ | ||A2/A3|||||||✅|||||✅|✅||| [Qwen-VL-Dense](../../tutorials/Qwen-VL-Dense.md) |
|
||||
| Qwen3-VL-MOE | ✅ | | ✅ | A2/A3||✅|✅|||✅|✅|✅|✅|✅|✅|✅|✅|256k||[Qwen3-VL-235B-A22B-Instruct](../../tutorials/Qwen3-VL-235B-A22B-Instruct.md)|
|
||||
| Qwen3-VL-MOE | ✅ | | ✅ | A2/A3||✅|✅|||✅|✅|✅|✅|✅|✅|✅|✅|256k||[Qwen3-VL-MOE](../../tutorials/Qwen3-VL-235B-A22B-Instruct.md)|
|
||||
| Qwen3-Omni-30B-A3B-Thinking | ✅ | ||A2/A3|||||||✅||✅|||||||[Qwen3-Omni-30B-A3B-Thinking](../../tutorials/Qwen3-Omni-30B-A3B-Thinking.md)|
|
||||
| Qwen2.5-Omni | ✅ | || A2/A3 |||||||||||||||| [Qwen2.5-Omni](../../tutorials/Qwen2.5-Omni.md) |
|
||||
| Qwen3-Omni | ✅ | || A2/A3 |||||||||||||||||
|
||||
| QVQ | ✅ | || A2/A3 |||||||||||||||||
|
||||
|
||||
Reference in New Issue
Block a user