From 503822c56c99867c6d8d7da9663f40baf67d744e Mon Sep 17 00:00:00 2001 From: meihanc Date: Thu, 8 Jan 2026 16:57:20 +0800 Subject: [PATCH] [Doc] Add Qwen3-Omni-30B-A3B-Thinking Tutorials (#3991) ### What this PR does / why we need it? Add Qwen3-Omni-30B-A3B-Thinking Tutorials ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/5326c89803566a131c928f7fdd2100b75c981a42 --------- Signed-off-by: Meihan-chen --- .../tutorials/Qwen3-Omni-30B-A3B-Thinking.md | 311 ++++++++++++++++++ docs/source/tutorials/index.md | 1 + .../support_matrix/supported_models.md | 3 +- 3 files changed, 314 insertions(+), 1 deletion(-) create mode 100644 docs/source/tutorials/Qwen3-Omni-30B-A3B-Thinking.md diff --git a/docs/source/tutorials/Qwen3-Omni-30B-A3B-Thinking.md b/docs/source/tutorials/Qwen3-Omni-30B-A3B-Thinking.md new file mode 100644 index 00000000..e63e9221 --- /dev/null +++ b/docs/source/tutorials/Qwen3-Omni-30B-A3B-Thinking.md @@ -0,0 +1,311 @@ +# Qwen3-Omni-30B-A3B-Thinking + +## Introduction + +Qwen3-Omni is the natively end-to-end multilingual omni-modal foundation models. It processes text, images, audio, and video, and delivers real-time streaming responses in both text and natural speech. We introduce several architectural upgrades to improve performance and efficiency. The Thinking model of Qwen3-Omni-30B-A3B, containing the thinker component, equipped with chain-of-thought reasoning, supporting audio, video, and text input, with text output. + +This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node deployment, accuracy and performance evaluation. + +## Supported Features +Refer to [supported features](https://docs.vllm.ai/projects/ascend/zh-cn/latest/user_guide/support_matrix/supported_models.html) to get the model's supported feature matrix. + +Refer to [feature guide](https://docs.vllm.ai/projects/ascend/zh-cn/latest/user_guide/feature_guide/index.html) to get the feature's configuration. + +## Environment Preparation +### Model Weight + +- `Qwen3-Omni-30B-A3B-Thinking` require 2 NPU Card(64G × 2).[Download model weight](https://modelscope.cn/models/Qwen/Qwen3-Omni-30B-A3B-Thinking) +It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/` + +### Installation + +:::::{tab-set} +::::{tab-item} Use docker image + +You can using our official docker image to run Qwen3-Omni-30B-A3B-Thinking directly + +Select an image based on your machine type and start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker). + +```{code-block} bash + :substitutions: +# Update --device according to your device (Atlas A2: /dev/davinci[0-7] Atlas A3:/dev/davinci[0-15]). +# Update the vllm-ascend image according to your environment. +# Note you should download the weight to /root/.cache in advance. +# Update the vllm-ascend image +export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version| +export NAME=vllm-ascend + +# Run the container using the defined variables +# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance +docker run --rm \ +--name $NAME \ +--net=host \ +--shm-size=1g \ +--device /dev/davinci0 \ +--device /dev/davinci1 \ +--device /dev/davinci_manager \ +--device /dev/devmm_svm \ +--device /dev/hisi_hdc \ +-v /usr/local/dcmi:/usr/local/dcmi \ +-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \ +-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ +-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ +-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ +-v /etc/ascend_install.info:/etc/ascend_install.info \ +-v /root/.cache:/root/.cache \ +-it $IMAGE bash +``` + +:::: +::::{tab-item} Build from source + +You can build all from source. + +- Install `vllm-ascend`, refer to [set up using python](../installation.md#set-up-using-python). + +:::: +::::: + +Please install system dependencies + +```bash +pip install qwen_omni_utils modelscope +# Used for audio processing. +apt-get update && apt-get install ffmpeg -y +# Check the installation. +ffmpeg -version +``` + +## Deployment +### Single-node Deployment +#### Offline Inference on Multi-NPU + +Run the following script to execute offline inference on multi-NPU: + +```python +import gc +import torch +import os +from vllm import LLM, SamplingParams +from vllm.distributed.parallel_state import ( + destroy_distributed_environment, + destroy_model_parallel +) +from modelscope import Qwen3OmniMoeProcessor +from qwen_omni_utils import process_mm_info + +os.environ["HCCL_BUFFSIZE"] = "1024" + +def clean_up(): + """Clean up distributed resources and NPU memory""" + destroy_model_parallel() + destroy_distributed_environment() + gc.collect() # Garbage collection to free up memory + torch.npu.empty_cache() + + +def main(): + MODEL_PATH = "Qwen3/Qwen3-Omni-30B-A3B-Thinking" + llm = LLM( + model=MODEL_PATH, + tensor_parallel_size=2, + enable_expert_parallel=True, + distributed_executor_backend="mp", + limit_mm_per_prompt={'image': 5, 'video': 2, 'audio': 3}, + max_model_len=32768, + ) + + sampling_params = SamplingParams( + temperature=0.6, + top_p=0.95, + top_k=20, + max_tokens=16384, + ) + + processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH) + messages = [ + { + "role": "user", + "content": [ + {"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"}, + {"type": "text", "text": "What can you see and hear? Answer in one sentence."} + ] + } + ] + + text = processor.apply_chat_template( + messages, + tokenize=False, + add_generation_prompt=True + ) + # 'use_audio_in_video = True' requires equal number of audio and video items, including audio from the video. + audios, images, videos = process_mm_info(messages, use_audio_in_video=True) + + inputs = { + "prompt": text, + "multi_modal_data": {}, + "mm_processor_kwargs": {"use_audio_in_video": True} + } + if images is not None: + inputs['multi_modal_data']['image'] = images + if videos is not None: + inputs['multi_modal_data']['video'] = videos + if audios is not None: + inputs['multi_modal_data']['audio'] = audios + + outputs = llm.generate([inputs], sampling_params=sampling_params) + for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") + + del llm + clean_up() + + +if __name__ == "__main__": + main() +``` + +#### Online Inference on Multi-NPU + +Run the following script to start the vLLM server on Multi-NPU: +For an Atlas A2 with 64 GB of NPU card memory, tensor-parallel-size should be at least 1, and for 32 GB of memory, tensor-parallel-size should be at least 2. + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Thinking --tensor-parallel-size 2 --enable_expert_parallel +``` + +## Functional Verification +Once your server is started, you can query the model with input prompts. + +```bash +curl http://localhost:8000/v1/chat/completions \ +-X POST \ +-H "Content-Type: application/json" \ +-d '{ + "model": "Qwen/Qwen3-Omni-30B-A3B-Thinking", + "messages": [ + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": { + "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg" + } + }, + { + "type": "audio_url", + "audio_url": { + "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav" + } + }, + { + "type": "video_url", + "video_url": { + "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4" + } + + }, + { + "type": "text", + "text": "Analyze this audio, image, and video together." + } + ] + } + ] +}' +``` + +## Accuracy Evaluation + +Here are accuracy evaluation methods. + +### Using EvalScope + +As an example, take the `gsm8k` `omnibench` `bbh` dataset as a test dataset, and run accuracy evaluation of `Qwen3-Omni-30B-A3B-Thinking` in online mode. +1. Refer to Using evalscope(https://docs.vllm.ai/projects/ascend/en/latest/developer_guide/evaluation/using_evalscope.html#install-evalscope-using-pip) for `evalscope`installation. +2. Run `evalscope` to execute the accuracy evaluation. + +```bash +evalscope eval \ + --model /root/.cache/modelscope/hub/models/Qwen/Qwen3-Omni-30B-A3B-Thinking \ + --api-url http://localhost:8000/v1 \ + --api-key EMPTY \ + --eval-type server \ + --datasets omni_bench, gsm8k, bbh \ + --dataset-args '{"omni_bench": { "extra_params": { "use_image": true, "use_audio": false}}}' \ + --eval-batch-size 1 \ + --generation-config '{"max_tokens": 10000, "temperature": 0.6}' \ + --limit 100 +``` + +3. After execution, you can get the result, here is the result of `Qwen3-Omni-30B-A3B-Thinking` in vllm-ascend:0.13.0rc1 for reference only. + +```bash + +-----------------------------+------------+----------+----------+-------+---------+---------+ +| Model | Dataset | Metric | Subset | Num | Score | Cat.0 | ++=============================+============+==========+==========+=======+=========+=========+ +| Qwen3-Omni-30B-A3B-Thinking | omni_bench | mean_acc | default | 100 | 0.44 | default | ++-----------------------------+------------+----------+----------+-------+---------+---------+ +| Qwen3-Omni-30B-A3B-Thinking | gsm8k | mean_acc | main | 100 | 0.98 | default | ++-----------------------------+-----------+----------+----------+-------+---------+---------+ +| Qwen3-Omni-30B-A3B-Thinking | bbh | mean_acc | OVERALL | 270 | 0.9148 | | ++-----------------------------+------------+----------+----------+-------+---------+---------+ +``` + +## Performance + +### Using vLLM Benchmark +Run performance evaluation of `Qwen3-Omni-30B-A3B-Thinking` as an example. +Refer to vllm benchmark for more details. +Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details. + +There are three `vllm bench` subcommand: +- `latency`: Benchmark the latency of a single batch of requests. +- `serve`: Benchmark the online serving throughput. +- `throughput`: Benchmark offline inference throughput. + +Take the `serve` as an example. Run the code as follows. + +```bash +VLLM_USE_MODELSCOPE=True +export MODEL=Qwen/Qwen3-Omni-30B-A3B-Thinking +python3 -m vllm.entrypoints.openai.api_server --model $MODEL --tensor-parallel-size 2 --swap-space 16 --disable-log-stats --disable-log-request --load-format dummy + +pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple +pip install -r vllm-ascend/benchmarks/requirements-bench.txt + +vllm bench serve --model $MODEL --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./ +``` + +After execution, you can get the result, here is the result of `Qwen3-Omni-30B-A3B-Thinking` in vllm-ascend:0.13.0rc1 for reference only. + +```bash +============ Serving Benchmark Result ============ +Successful requests: 200 +Failed requests: 0 +Request rate configured (RPS): 1.00 +Benchmark duration (s): 211.90 +Total input tokens: 40000 +Total generated tokens: 25600 +Request throughput (req/s): 0.94 +Output token throughput (tok/s): 120.81 +Peak output token throughput (tok/s): 216.00 +Peak concurrent requests: 24.00 +Total token throughput (tok/s): 309.58 +---------------Time to First Token---------------- +Mean TTFT (ms): 215.50 +Median TTFT (ms): 211.51 +P99 TTFT (ms): 317.18 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 98.96 +Median TPOT (ms): 99.19 +P99 TPOT (ms): 101.52 +---------------Inter-token Latency---------------- +Mean ITL (ms): 99.02 +Median ITL (ms): 96.10 +P99 ITL (ms): 176.02 +================================================== +``` diff --git a/docs/source/tutorials/index.md b/docs/source/tutorials/index.md index 3c7979a7..5b6d63f3 100644 --- a/docs/source/tutorials/index.md +++ b/docs/source/tutorials/index.md @@ -16,6 +16,7 @@ Qwen3_reranker Qwen3-8B-W4A8 Qwen3-32B-W4A4 Qwen3-Next +Qwen3-Omni-30B-A3B-Thinking.md DeepSeek-V3.1.md DeepSeek-V3.2.md DeepSeek-R1.md diff --git a/docs/source/user_guide/support_matrix/supported_models.md b/docs/source/user_guide/support_matrix/supported_models.md index f021c718..3821e113 100644 --- a/docs/source/user_guide/support_matrix/supported_models.md +++ b/docs/source/user_guide/support_matrix/supported_models.md @@ -62,7 +62,8 @@ Get the latest info here: https://github.com/vllm-project/vllm-ascend/issues/160 | Qwen2-VL | ✅ | || A2/A3 ||||||||||||||||| | Qwen2.5-VL | ✅ | | ✅ | A2/A3 | ✅ | ✅ | ✅ ||| ✅ | ✅ |||| ✅ | ✅ | ✅ | 30k || [Qwen-VL-Dense](../../tutorials/Qwen-VL-Dense.md) | | Qwen3-VL | ✅ | ||A2/A3|||||||✅|||||✅|✅||| [Qwen-VL-Dense](../../tutorials/Qwen-VL-Dense.md) | -| Qwen3-VL-MOE | ✅ | | ✅ | A2/A3||✅|✅|||✅|✅|✅|✅|✅|✅|✅|✅|256k||[Qwen3-VL-235B-A22B-Instruct](../../tutorials/Qwen3-VL-235B-A22B-Instruct.md)| +| Qwen3-VL-MOE | ✅ | | ✅ | A2/A3||✅|✅|||✅|✅|✅|✅|✅|✅|✅|✅|256k||[Qwen3-VL-MOE](../../tutorials/Qwen3-VL-235B-A22B-Instruct.md)| +| Qwen3-Omni-30B-A3B-Thinking | ✅ | ||A2/A3|||||||✅||✅|||||||[Qwen3-Omni-30B-A3B-Thinking](../../tutorials/Qwen3-Omni-30B-A3B-Thinking.md)| | Qwen2.5-Omni | ✅ | || A2/A3 |||||||||||||||| [Qwen2.5-Omni](../../tutorials/Qwen2.5-Omni.md) | | Qwen3-Omni | ✅ | || A2/A3 ||||||||||||||||| | QVQ | ✅ | || A2/A3 |||||||||||||||||